You are on page 1of 1279

TEAM LinG

Encyclopedia of
Data Warehousing
and Mining

John Wang
Montclair State University, USA

IDEA GROUP REFERENCE


Hershey London Melbourne Singapore

TEAM LinG
Acquisitions Editor: Rene Davies
Development Editor: Kristin Roth
Senior Managing Editor: Amanda Appicello
Managing Editor: Jennifer Neidig
Copy Editors: Eva Brennan, Alana Bubnis, Rene Davies and Sue VanderHook
Typesetters: Diane Huskinson, Sara Reed and Larissa Zearfoss
Support Staff: Michelle Potter
Cover Design: Lisa Tosheff
Printed at: Yurchak Printing Inc.

Published in the United States of America by


Idea Group Reference (an imprint of Idea Group Inc.)
701 E. Chocolate Avenue, Suite 200
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: cust@idea-group.com
Web site: http://www.idea-group-ref.com

and in the United Kingdom by


Idea Group Reference (an imprint of Idea Group Inc.)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 3313
Web site: http://www.eurospan.co.uk

Copyright 2006 by Idea Group Inc. All rights reserved. No part of this publication may be reproduced, stored or distributed in
any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or
companies does not indicate a claim of ownership by IGI of the trademark or registered trademark.

Library of Congress Cataloging-in-Publication Data

Encyclopedia of data warehousing and mining / John Wang, editor.


p. cm.
Includes bibliographical references and index.
ISBN 1-59140-557-2 (hard cover) -- ISBN 1-59140-559-9 (ebook)
1. Data warehousing. 2. Data mining. I. Wang, John, 1955-
QA76.9.D37E52 2005
005.74--dc22
2005004522

British Cataloguing in Publication Data


A Cataloguing in Publication record for this book is available from the British Library.

All work contributed to this encyclopedia set is new, previously-unpublished material. The views expressed in this encyclopedia set
are those of the authors, but not necessarily of the publisher.

TEAM LinG
Editorial Advisory Board

Floriana Esposito, Universitadi Bari, Italy

Chandrika Kamath, Lawrence Livermore National Laboratory, USA

Huan Liu, Arizona State University, USA

Sach Mukherjee, University of Oxford, UK

Alan Oppenheim, Montclair State University, USA

Marco Ramoni, Harvard University, USA

Mehran Sahami, Stanford University, USA

Michele Sebag, Universite Paris-Sud, France

Alexander Tuzhilin, New York University, USA

Ning Zhong, Maebashi Institute of Technology, Japan

TEAM LinG
List of Contributors

Abdulghani, Amin A. / Quantiva, USA


Abidin, Taufik / North Dakota State University, USA
Abou-Seada, Magda / Middlesex University, UK
Agresti, William W. / Johns Hopkins University, USA
Amoretti, Maria Suzana Marc / Federal University of Rio Grande do Sul (UFRGS), Brasil
An, Aijun / York University, Canada
Aradhye, Hrishikesh B. / SRI International, USA
Artz, John M. / The George Washington University, USA
Ashrafi, Mafruz Zaman / Monash University, Australia
Athappilly, Kuriakose / Western Michigan University, USA
Awad, Mamoun / University of Texas at Dallas, USA
Bala, Pradip Kumar / IBAT, Deemed University, India
Banerjee, Protima / Drexel University, USA
Banerjee, Rabindra Nath / Indian Institute of Technology, Kharagpur, India
Barko, Christopher D. / The University of North Carolina at Greensboro, USA
Bashir, Ahmad / University of Texas at Dallas, USA
Becerra, Victor M. / University of Reading, UK
Bellatreche, Ladjel / LISI/ENSMA, France
Besemann, Christopher / North Dakota State University, USA
Betz, Andrew L. / Progressive Insurance, USA
Beynon, Malcolm J. / Cardiff University, UK
Bhatnagar, Shalabh / Indian Institute of Science, India
Bhowmick, Sourav Saha / Nanyang Technological University, Singapore
Bickel, Steffen / Humboldt-Universitt zu Berlin, Germany
Bittanti, Sergio / Politecnico di Milano, Italy
Borges, Thyago / Catholic University of Pelotas, Brazil
Boros, Endre / RUTCOR, Rutgers University, USA
Bose, Indranil / The University of Hong Kong, Hong Kong
Boulicaut, Jean-Francois / INSA de Lyon, France
Breault, Joseph L. / Ochsner Clinic Foundation, USA
Bretschneider, Timo R. / Nanyang Technological University, Singapore
Brown, Marvin L. / Grambling State University, USA
Bruha, Ivan / McMaster University, Canada
Bruining, Nico / Erasmus Medical Thorax Center, The Netherlands
Buccafurri, Francesco / University Mediterranea of Reggio Calabria, Italy
Burr, Tom / Los Alamos National Laboratory, USA
Butler, Shane M. / Monash University, Australia
Cadot, Martine / University of Henri Poincar/LORIA, Nancy, France
Caelli, Terry / Australian National University, Australia
Calvo, Roberto Wolfler / Universit de Technologie de Troyes, France
Caramia, Massimiliano / Istituto per le Applicazioni del Calcolo IAC-CNR, Italy
Cardoso, Jorge / University of Madeira, Portugal

TEAM LinG
Carneiro, Sofia / University of Minho, Portugal
Cerchiello, Paola / University of Pavia, Italy
Chakravarty, Indrani / Indian Institute of Technology, India
Chalasani, Suresh / University of Wisconsin-Parkside, USA
Chang, Chia-Hui / National Central University, Taiwan
Chen, Qiyang / Montclair State University, USA
Chen, Shaokang / The University of Queensland, Australia
Chen, Yao / University of Massachusetts Lowell, USA
Chen, Zhong / Shanghai JiaoTong University, PR China
Chien, Chen-Fu / National Tsing Hua University, Taiwan
Cho, Vincent / The Hong Kong Polytechnic University, Hong Kong
Chu, Feng / Nanyang Technological University, Singapore
Chu, Wesley / University of California - Los Angeles, USA
Chung, Seokkyung / University of Southern California, USA
Chung, Soon M. / Wright State University, USA
Conversano, Claudio / University of Cassino, Italy
Cook, Diane J. / University of Texas at Arlington, USA
Cook, Jack / Rochester Institute of Technology, USA
Cunningham, Colleen / Drexel University, USA
Dai, Honghua / Deakin University, Australia
Daly, Olena / Monash University, Australia
Dardzinska, Agnieszka / Bialystok Technical University, Poland
Das, Gautam / The University of Texas at Arlington, USA
de Campos, Luis M. / Universidad de Granada, Spain
de Luigi, Fabio / University of Ferrara, Italy
De Meo, Pasquale / Universit Mediterranea di Reggio Calabria, Italy
DeLorenzo, Gary J. / Robert Morris University, USA
Delve, Janet / University of Portsmouth, UK
Denoyer, Ludovic / University of Paris VI, France
Denton, Anne / North Dakota State University, USA
Dhaenens, Clarisse / LIFL, University of Lille 1, France
Diday, Edwin / University of Dauphine, France
Dillon, Tharam / University of Technology Sydney, Australia
Ding, Qiang / Concordia College, USA
Ding, Qin / Pennsylvania State University, USA
Domeniconi, Carlotta / George Mason University, USA
Dorado de la Calle, Julin / University of A Corua, Spain
Dorai, Chitra / IBM T. J. Watson Research Center, USA
Drew, James H. / Verizon Laboratories, USA
Dumitriu, Luminita / Dunarea de Jos University, Romania
Ester, Martin / Simon Fraser University, Canada
Fan, Weiguo / Virginia Polytechnic Institute and State University, USA
Felici, Giovanni / Istituto di Analisi dei Sistemi ed Informatica (IASI-CNR), Italy
Feng, Ling / University of Twente, The Netherlands
Fernndez, Vctor Fresno / Universidad Rey Juan Carlos, Spain
Fernndez-Luna, Juan M. / Universidad de Granada, Spain
Fischer, Ingrid / Friedrich-Alexander University Erlangen-Nrnberg, Germany
Fu, Ada Wai-Chee / The Chinese University of Hong Kong, Hong Kong
Fu, Li M. / University of Florida, USA
Fu, Yongjian / Cleveland State University, USA
Fung, Benjamin C. M. / Simon Fraser University, Canada
Fung, Benny Yiu-ming / The Hong Kong Polytechnic University, Hong Kong
Gallinari, Patrick / University of Paris VI, France
Galvo, Roberto Kawakami Harrop / Instituto Tecnolgico de Aeronutica, Brazil

TEAM LinG
Ganguly, Auroop R. / Oak Ridge National Laboratory, USA
Garatti, Simone / Politecnico di Milano, Italy
Garrity, Edward J. / Canisius College, USA
Ge, Nanxiang / Aventis, USA
Gehrke, Johannes / Cornell University, USA
Georgieva, Olga / Institute of Control and System Research, Bulgaria
Giudici, Paolo / University of Pavia, Italy
Goodman, Kenneth W. / University of Miami, USA
Greenidge, Charles / University of the West Indies, Barbados
Grzymala-Busse, Jerzy W. / University of Kansas, USA
Gunopulos, Dimitrios / University of California, USA
Guo, Hong / Southern Illinois University, USA
Gupta, Amar / University of Arizona, USA
Gupta, P. / Indian Institute of Technology, India
Haastrup, Palle / European Commission, Italy
Hamdi, Mohamed Salah / UAE University, UAE
Hamel, Lutz / University of Rhode Island, USA
Hamers, Ronald / Erasmus Medical Thorax Center, The Netherlands
Hammer, Peter L. / RUTCOR, Rutgers University, USA
Han, Hyoil / Drexel University, USA
Harms, Sherri K. / University of Nebraska at Kearney, USA
Hogo, Mofreh / Czech Technical University, Czech Republic
Holder, Lawrence B. / University of Texas at Arlington, USA
Hong, Yu / BearingPoint Inc, USA
Horiguchi, Susumu / Tohoku University, Japan
Hou, Wen-Chi / Southern Illinois University, USA
Hsu, Chun-Nan / Institute of Information Science, Academia Sinica, Taiwan
Hsu, William H. / Kansas State University, USA
Hu, Wen-Chen / University of North Dakota, USA
Hu, Xiaohua / Drexel University, USA
Huang, Xiangji / York University, Canada
Huete, Juan F. / Universidad de Granada, Spain
Hwang, Sae / University of Texas at Arlington, USA
Ibaraki, Toshihide / Kwansei Gakuin University, Japan
Ito, Takao / Ube National College of Technology, Japan
Jahangiri, Mehrdad / University of Southern California, USA
Jrvelin, Kalervo / University of Tampere, Finland
Jha, Neha / Indian Institute of Technology, Kharagpur, India
Jin, Haihao / University of Kentucky, USA
Jourdan, Laetitia / LIFL, University of Lille 1, France
Jun, Jongeun / University of Southern California, USA
Kanapady, Ramdev / University of Minnesota, USA
Kao, Odej / University of Paderborn, Germany
Karakaya, Murat / Bilkent University, Turkey
Katsaros, Dimitrios / Aristotle University, Greece
Kern-Isberner, Gabriele / University of Dortmund, Germany
Khan, Latifur / University of Texas at Dallas, USA
Khan, M. Riaz / University of Massachusetts Lowell, USA
Khan, Shiraj / University of South Florida, USA
Kickhfel, Rodrigo Branco / Catholic University of Pelotas, Brazil
Kim, Han-Joon / The University of Seoul, Korea
Klawonn, Frank / University of Applied Sciences Braunschweig/Wolfenbuettel, Germany
Koeller, Andreas / Montclair State University, USA
Kokol, Peter / University of Maribor, FERI, Slovenia

TEAM LinG
Kontio, Juha / Turku Polytechnic, Finland
Koppelaar, Henk / Delft University of Technology, The Netherlands
Kroeze, Jan H. / University of Pretoria, South Africa
Kros, John F. / East Carolina University, USA
Kryszkiewicz, Marzena / Warsaw University of Technology, Poland
Kusiak, Andrew / The University of Iowa, USA
la Tendresse, Ingo / Technical University of Clausthal, Germany
Lax, Gianluca / University Mediterranea of Reggio Calabria, Italy
Layos, Luis Magdalena / Universidad Politcnica de Madrid, Spain
Lazarevic, Aleksandar / University of Minnesota, USA
Lee, Chung-Hong / National Kaohsiung University of Applied Sciences, Taiwan
Lee, Chung-wei / Auburn University, USA
Lee, JeongKyu / University of Texas at Arlington, USA
Lee, Tzai-Zang / National Cheng Kung University, Taiwan, ROC
Lee, Zu-Hsu / Montclair State University, USA
Lee-Post, Anita / University of Kentucky, USA
Leni, Mitja / University of Maribor, FERI, Slovenia
Levary, Reuven R. / Saint Louis University, USA
Li, Tao / Florida International University, USA
Li, Wenyuan / Nanyang Technological University, Singapore
Liberati, Diego / Consiglio Nazionale delle Ricerche, Italy
Licthnow, Daniel / Catholic University of Pelotas, Brazil
Lim, Ee-Peng / Nanyang Technological University, Singapore
Lin, Beixin (Betsy) / Montclair State University, USA
Lin, Tsau Young / San Jose State University, USA
Lindell, Yehuda / Bar-Ilan University, Israel
Lingras, Pawan / Saint Marys University, Canada
Liu, Chang / Northern Illinois University, USA
Liu, Huan / Arizona State University, USA
Liu, Li / Aventis, USA
Liu, Xiaohui / Brunel University, UK
Liu, Xiaoqiang / Delft University of Technology, The Netherlands, and Donghua University, China
Lo, Victor S.Y. / Fidelity Personal Investments, USA
Lodhi, Huma / Imperial College London, UK
Loh, Stanley / Catholic University of Pelotas, Brazil, and Lutheran University of Brasil, Brazil
Long, Lori K. / Kent State University, USA
Lorenzi, Fabiana / Universidade Luterana do Brasil, Brazil
Lovell, Brian C. / The University of Queensland, Australia
Lu, June / University of Houston-Victoria, USA
Lu, Xinjian / California State University, Hayward, USA
Lutu, Patricia E.N. / University of Pretoria, South Africa
Ma, Sheng / IBM T.J. Watson Research Center, USA
Maj, Jean-Baptiste / LORIA/INRIA, France
Maloof, Marcus A. / Georgetown University, USA
Mangamuri, Murali / Wright State University, USA
Mani, D. R. / Massachusetts Institute of Technology, USA, and Harvard University, USA
Maniezzo, Vittorio / University of Bologna, Italy
Manolopoulos, Yannis / Aristotle University, Greece
Marchetti, Carlo / Universit di Roma La Sapienza, Italy
Mart, Rafael / Universitat de Valncia, Spain
Masseglia, Florent / INRIA Sophia Antipolis, France
Mathieu, Richard / Saint Louis University, USA
McLeod, Dennis / University of Southern California, USA
Mecella, Massimo / Universit di Roma La Sapienza, Italy

TEAM LinG
Meinl, Thorsten / Friedrich-Alexander University Erlangen-Nrnberg, Germany
Meo, Rosa / Universit degli Studi di Torino, Italy
Mishra, Nilesh / Indian Institute of Technology, India
Mladeni , Dunja / Jozef Stefan Institute, Slovenia
Mobasher, Bamshad / DePaul University, USA
Mohania, Mukesh / IBM India Research Lab, India
Morantz, Brad / Georgia State University, USA
Moreira, Adriano / University of Minho, Portugal
Motiwalla, Luvai / University of Massachusetts Lowell, USA
Muhlenbach, Fabrice / EURISE, Universit Jean Monnet - Saint-Etienne, France
Mukherjee, Sach / University of Oxford, UK
Murty, M. Narasimha / Indian Institute of Science, India
Muruzbal, Jorge / University Rey Juan Carlos, Spain
Muselli, Marco / Italian National Research Council, Italy
Musicant, David R. / Carleton College, USA
Muslea, Ion / SRI International, USA
Nanopoulos, Alexandros / Aristotle University, Greece
Nasraoui, Olfa / University of Louisville, USA
Nayak, Richi / Queensland University of Technology, Australia
Nemati, Hamid R. / The University of North Carolina at Greensboro, USA
Ng, Vincent To-yee / The Hong Kong Polytechnic University, Hong Kong
Ng, Wee-Keong / Nanyang Technological University, Singapore
Nicholson, Scott / Syracuse University School of Information Studies, USA
ODonnell, Joseph B. / Canisius College, USA
Oh, JungHwan / University of Texas at Arlington, USA
Oppenheim, Alan / Montclair State University, USA
Owens, Jan / University of Wisconsin-Parkside, USA
Oza, Nikunj C. / NASA Ames Research Center, USA
Pang, Les / National Defense University, USA
Paquet, Eric / National Research Council of Canada, Canada
Pasquier, Nicolas / Universit de Nice-Sophia Antipolis, France
Pathak, Praveen / University of Florida, USA
Perlich, Claudia / IBM Research, USA
Perrizo, William / North Dakota State University, USA
Peter, Hadrian / University of the West Indies, Barbados
Peterson, Richard L. / Montclair State University, USA
Pharo, Nils / Oslo University College, Norway
Piltcher, Gustavo / Catholic University of Pelotas, Brazil
Poncelet, Pascal / Ecole des Mines dAls, France
Portougal, Victor / The University of Auckland, New Zealand
Povalej, Petra / University of Maribor, FERI, Slovenia
Primo, Tiago / Catholic University of Pelotas, Brazil
Provost, Foster / New York University, USA
Psaila, Giuseppe / Universit degli Studi di Bergamo, Italy
Quattrone, Giovanni / Universit Mediterranea di Reggio Calabria, Italy
Rabual Dopico, Juan R. / University of A Corua, Spain
Rahman, Hakikur / SDNP, Bangladesh
Rakotomalala, Ricco / ERIC, Universit Lumire Lyon 2, France
Ramoni, Marco F. / Harvard Medical School, USA
Ras, Zbigniew W. / University of North Carolina, Charlotte, USA
Rea, Alan / Western Michigan University, USA
Rehm, Frank / German Aerospace Center, Germany
Ricci, Francesco / eCommerce and Tourism Research Laboratory, ITC-irst, Italy
Rivero Cebrin, Daniel / University of A Corua, Spain

TEAM LinG
Sacharidis, Dimitris / University of Southern California, USA
Saldaa, Ramiro / Catholic University of Pelotas, Brazil
Sanders, G. Lawrence / State University of New York at Buffalo, USA
Santos, Maribel Yasmina / University of Minho, Portugal
Saquer, Jamil M. / Southwest Missouri State University, USA
Sayal, Mehmet / Hewlett-Packard Labs, USA
Saygin, Ycel / Sabanci University, Turkey
Scannapieco, Monica / Universit di Roma La Sapienza, Italy
Schafer, J. Ben / University of Northern Iowa, USA
Scheffer, Tobias / Humboldt-Universitt zu Berlin, Germany
Schneider, Michel / Blaise Pascal University, France
Scime, Anthony / State University of New York College Brockport, USA
Sebastiani, Paola / Boston University School of Public Health, USA
Segall, Richard S. / Arkansas State University, USA
Shah, Shital C. / The University of Iowa, USA
Shahabi, Cyrus / University of Southern California, USA
Shen, Hong / Japan Advanced Institute of Science and Technology, Japan
Sheng, Yihua Philip / Southern Illinois University, USA
Siciliano, Roberta / University of Naples Federico II, Italy
Simitsis, Alkis / National Technical University of Athens, Greece
Simes, Gabriel / Catholic University of Pelotas, Brazil
Sindoni, Giuseppe / ISTAT - National Institute of Statistics, Italy
Singh, Richa / Indian Institute of Technology, India
Smets, Philippe / Universit Libre de Bruxelles, Belgium
Smith, Kate A. / Monash University, Australia
Song, Il-Yeol / Drexel University, USA
Song, Min / Drexel University, USA
Sounderpandian, Jayavel / University of Wisconsin-Parkside, USA
Souto, Nieves Pedreira / University of A Corua, Spain
Stanton, Jeffrey / Syracuse University School of Information Studies, USA
Sundaram, David / The University of Auckland, New Zealand
Sural, Shamik / Indian Institute of Technology, Kharagpur, India
Talbi, El-Ghazali / LIFL, University of Lille 1, France
Tan, Hee Beng Kuan / Nanyang Technological University, Singapore
Tan, Rebecca Boon-Noi / Monash University, Australia
Taniar, David / Monash University, Australia
Teisseire, Maguelonne / University of Montpellier II, France
Terracina, Giorgio / Universit della Calabria, Italy
Thelwall, Mike / University of Wolverhampton, UK
Theodoratos, Dimitri / New Jersey Institute of Technology, USA
Thomasian, Alexander / New Jersey Institute of Technology, USA
Thuraisingham, Bhavani / The MITRE Corporation, USA
Tininini, Leonardo / CNR - Istituto di Analisi dei Sistemi e Informatica Antonio Ruberti, Italy
Troutt, Marvin D. / Kent State University, USA
Truemper, Klaus / University of Texas at Dallas, USA
Tsay, Li-Shiang / University of North Carolina, Charlotte, USA
Tzacheva, Angelina / University of North Carolina, Charlotte, USA
Ulusoy, zgr / Bilkent University, Turkey
Ursino, Domenico / Universit Mediterranea di Reggio Calabria, Italy
Vardaki, Maria / University of Athens, Greece
Vargas, Juan E. / University of South Carolina, USA
Vatsa, Mayank / Indian Institute of Technology, India
Viertl, Reinhard / Vienna University of Technology, Austria
Viktor, Herna L. / University of Ottawa, Canada

TEAM LinG
Virgillito, Antonino / Universit di Roma La Sapienza, Italy
Viswanath, P. / Indian Institute of Science, India
Walter, Jrg Andreas / University of Bielefeld, Germany
Wang, Dajin / Montclair State University, USA
Wang, Hai / Saint Marys University, Canada
Wang, Ke / Simon Fraser University, Canada
Wang, Lipo / Nanyang Technological University, Singapore
Wang, Shouhong / University of Massachusetts Dartmouth, USA
Wang, Xiong / California State University at Fullerton, USA
Webb, Geoffrey I. / Monash University, Australia
Wen, Ji-Rong / Microsoft Research Asia, China
West, Chad / IBM Canada Limited, Canada
Wickramasinghe, Nilmini / Cleveland State University, USA
Wieczorkowska, Alicja A. / Polish-Japanese Institute of Information Technology, Poland
Winkler, William E. / U.S. Bureau of the Census, USA
Wong, Raymond Chi-Wing / The Chinese University of Hong Kong, Hong Kong
Woon, Yew-Kwong / Nanyang Technological University, Singapore
Wu, Chien-Hsing / National University of Kaohsiung, Taiwan, ROC
Xiang, Yang / University of Guelph, Canada
Xing, Ruben / Montclair State University, USA
Yan, Feng / Williams Power, USA
Yan, Rui / Saint Marys University, Canada
Yang, Hsin-Chang / Chang Jung University, Taiwan
Yang, Hung-Jen / National Kaohsiung Normal University, Taiwan
Yang, Ying / Monash University, Australia
Yao, James E. / Montclair State University, USA
Yao, Yiyu / University of Regina, Canada
Yavas, Gkhan / Bilkent University, Turkey
Yeh, Jyh-haw / Boise State University, USA
Yoo, Illhoi / Drexel University, USA
Yu, Lei / Arizona State University, USA
Zendulka, Jaroslav / Brno University of Technology, Czech Republic
Zhang, Bin / Hewlett-Packard Research Laboratories, USA
Zhang, Chengqi / University of Technology Sydney, Australia
Zhang, Shichao / University of Technology Sydney, Australia
Zhang, Yu-Jin / Tsinghua University, Beijing, China
Zhao, Qiankun / Nanyang Technological University, Singapore
Zhao, Yan / University of Regina, Canada
Zhao, Yuan / Nanyang Technological University, Singapore
Zhou, Senqiang / Simon Fraser University, Canada
Zhou, Zhi-Hua / Nanjing University, China
Zhu, Dan / Iowa State University, USA
Zhu, Qiang / University of Michigan, USA
Ziad, Tarek / NUXEO, France
Ziarko, Wojciech / University of Regina, Canada
Zorman, Milan / University of Maribor, FERI, Slovenia
Zou, Qinghua / University of California - Los Angeles, USA

TEAM LinG
Contents
by Volume

VOLUME I

Action Rules / Zbigniew W. Ras, Angelina Tzacheva, and Li-Shiang Tsay ........................................................... 1

Active Disks for Data Mining / Alexander Thomasian ........................................................................................... 6

Active Learning with Multiple Views / Ion Muslea ................................................................................................. 12

Administering and Managing a Data Warehouse / James E. Yao, Chang Liu, Qiyang Chen, and June Lu .......... 17

Agent-Based Mining of User Profiles for E-Services / Pasquale De Meo, Giovanni Quattrone,
Giorgio Terracina, and Domenico Ursino ......................................................................................................... 23

Aggregate Query Rewriting in Multidimensional Databases / Leonardo Tininini ................................................. 28

Aggregation for Predictive Modeling with Relational Data / Claudia Perlich and Foster Provost ....................... 33

API Standardization Efforts for Data Mining / Jaroslav Zendulka ......................................................................... 39

Application of Data Mining to Recommender Systems, The / J. Ben Schafer ........................................................ 44

Approximate Range Queries by Histograms in OLAP / Francesco Buccafurri and Gianluca Lax ........................ 49

Artificial Neural Networks for Prediction / Rafael Mart ......................................................................................... 54

Association Rule Mining / Yew-Kwong Woon, Wee-Keong Ng, and Ee-Peng Lim ................................................ 59

Association Rule Mining and Application to MPIS / Raymond Chi-Wing Wong and Ada Wai-Chee Fu ............. 65

Association Rule Mining of Relational Data / Anne Denton and Christopher Besemann ..................................... 70

Association Rules and Statistics / Martine Cadot, Jean-Baptiste Maj, and Tarek Ziad ..................................... 74

Automated Anomaly Detection / Brad Morantz ..................................................................................................... 78

Automatic Musical Instrument Sound Classification / Alicja A. Wieczorkowska ................................................... 83

Bayesian Networks / Ahmad Bashir, Latifur Khan, and Mamoun Awad ............................................................... 89

TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective / Les Pang ....................................................... 94

Bibliomining for Library Decision-Making / Scott Nicholson and Jeffrey Stanton ................................................. 100

Biomedical Data Mining Using RBF Neural Networks / Feng Chu and Lipo Wang ................................................ 106

Building Empirical-Based Knowledge for Design Recovery / Hee Beng Kuan Tan and Yuan Zhao ...................... 112

Business Processes / David Sundaram and Victor Portougal ............................................................................... 118

Case-Based Recommender Systems / Fabiana Lorenzi and Francesco Ricci ....................................................... 124

Categorization Process and Data Mining / Maria Suzana Marc Amoretti ............................................................. 129

Center-Based Clustering and Regression Clustering / Bin Zhang ........................................................................... 134

Classification and Regression Trees / Johannes Gehrke ........................................................................................ 141

Classification Methods / Aijun An ........................................................................................................................... 144

Closed-Itemset Incremental-Mining Problem / Luminita Dumitriu ......................................................................... 150

Cluster Analysis in Fitting Mixtures of Curves / Tom Burr ..................................................................................... 154

Clustering Analysis and Algorithms / Xiangji Huang ............................................................................................. 159

Clustering in the Identification of Space Models / Maribel Yasmina Santos, Adriano Moreira,
and Sofia Carneiro .............................................................................................................................................. 165

Clustering of Time Series Data / Anne Denton ......................................................................................................... 172

Clustering Techniques / Sheng Ma and Tao Li ....................................................................................................... 176

Clustering Techniques for Outlier Detection / Frank Klawonn and Frank Rehm ................................................. 180

Combining Induction Methods with the Multimethod Approach / Mitja Leni, Peter Kokol, Petra Povalej
and Milan Zorman ............................................................................................................................................... 184

Comprehensibility of Data Mining Algorithms / Zhi-Hua Zhou .............................................................................. 190

Computation of OLAP Cubes / Amin A. Abdulghani .............................................................................................. 196

Concept Drift / Marcus A. Maloof ............................................................................................................................ 202

Condensed Representations for Data Mining / Jean-Francois Boulicaut ............................................................. 207

Content-Based Image Retrieval / Timo R. Bretschneider and Odej Kao ................................................................. 212

Continuous Auditing and Data Mining / Edward J. Garrity, Joseph B. ODonnell,


and G. Lawrence Sanders ................................................................................................................................... 217

Data Driven vs. Metric Driven Data Warehouse Design / John M. Artz ................................................................. 223

Data Management in Three-Dimensional Structures / Xiong Wang ........................................................................ 228

TEAM LinG
Data Mining and Decision Support for Business and Science / Auroop R. Ganguly, Amar Gupta,
and Shiraj Khan .................................................................................................................................................. 233

Data Mining and Warehousing in Pharma Industry / Andrew Kusiak and Shital C. Shah .................................... 239

Data Mining for Damage Detection in Engineering Structures / Ramdev Kanapady


and Aleksandar Lazarevic .................................................................................................................................. 245

Data Mining for Intrusion Detection / Aleksandar Lazarevic ................................................................................. 251

Data Mining in Diabetes Diagnosis and Detection / Indranil Bose ........................................................................ 257

Data Mining in Human Resources / Marvin D. Troutt and Lori K. Long ............................................................... 262

Data Mining in the Federal Government / Les Pang ................................................................................................ 268

Data Mining in the Soft Computing Paradigm / Pradip Kumar Bala, Shamik Sural,
and Rabindra Nath Banerjee .............................................................................................................................. 272

Data Mining Medical Digital Libraries / Colleen Cunningham and Xiaohua Hu ................................................... 278

Data Mining Methods for Microarray Data Analysis / Lei Yu and Huan Liu ......................................................... 283

Data Mining with Cubegrades / Amin A. Abdulghani ............................................................................................. 288

Data Mining with Incomplete Data / Hai Wang and Shouhong Wang .................................................................... 293

Data Quality in Cooperative Information Systems / Carlo Marchetti, Massimo Mecella, Monica Scannapieco,
and Antonino Virgillito ...................................................................................................................................... 297

Data Quality in Data Warehouses / William E. Winkler .......................................................................................... 302

Data Reduction and Compression in Database Systems / Alexander Thomasian .................................................. 307

Data Warehouse Back-End Tools / Alkis Simitsis and Dimitri Theodoratos ......................................................... 312

Data Warehouse Performance / Beixin (Betsy) Lin, Yu Hong, and Zu-Hsu Lee ..................................................... 318

Data Warehousing and Mining in Supply Chains / Richard Mathieu and Reuven R. Levary ............................... 323

Data Warehousing Search Engine / Hadrian Peter and Charles Greenidge ......................................................... 328

Data Warehousing Solutions for Reporting Problems / Juha Kontio ..................................................................... 334

Database Queries, Data Mining, and OLAP / Lutz Hamel ....................................................................................... 339

Database Sampling for Data Mining / Patricia E.N. Lutu ....................................................................................... 344

DEA Evaluation of Performance of E-Business Initiatives / Yao Chen, Luvai Motiwalla, and M. Riaz Khan ...... 349

Decision Tree Induction / Roberta Siciliano and Claudio Conversano ............................................................... 353

Diabetic Data Warehouses / Joseph L. Breault ....................................................................................................... 359

TEAM LinG
Discovering an Effective Measure in Data Mining / Takao Ito ............................................................................... 364

Discovering Knowledge from XML Documents / Richi Nayak .............................................................................. 372

Discovering Ranking Functions for Information Retrieval / Weiguo Fan and Praveen Pathak ............................ 377

Discovering Unknown Patterns in Free Text / Jan H. Kroeze ................................................................................. 382

Discovery Informatics / William W. Agresti ............................................................................................................. 387

Discretization for Data Mining / Ying Yang and Geoffrey I. Webb .......................................................................... 392

Discretization of Continuous Attributes / Fabrice Muhlenbach and Ricco Rakotomalala .................................. 397

Distributed Association Rule Mining / Mafruz Zaman Ashrafi, David Taniar, and Kate A. Smith ........................ 403

Distributed Data Management of Daily Car Pooling Problems / Roberto Wolfler Calvo, Fabio de Luigi,
Palle Haastrup, and Vittorio Maniezzo ............................................................................................................. 408

Drawing Representative Samples from Large Databases / Wen-Chi Hou, Hong Guo, Feng Yan,
and Qiang Zhu ..................................................................................................................................................... 413

Efficient Computation of Data Cubes and Aggregate Views / Leonardo Tininini .................................................. 421

Embedding Bayesian Networks in Sensor Grids / Juan E. Vargas .......................................................................... 427

Employing Neural Networks in Data Mining / Mohamed Salah Hamdi .................................................................. 433

Enhancing Web Search through Query Log Mining / Ji-Rong Wen ....................................................................... 438

Enhancing Web Search through Web Structure Mining / Ji-Rong Wen ................................................................. 443

Ensemble Data Mining Methods / Nikunj C. Oza ................................................................................................... 448

Ethics of Data Mining / Jack Cook ......................................................................................................................... 454

Ethnography to Define Requirements and Data Model / Gary J. DeLorenzo ......................................................... 459

Evaluation of Data Mining Methods / Paolo Giudici ............................................................................................. 464

Evolution of Data Cube Computational Approaches / Rebecca Boon-Noi Tan ...................................................... 469

Evolutionary Computation and Genetic Algorithms / William H. Hsu ..................................................................... 477

Evolutionary Data Mining For Genomics / Laetitia Jourdan, Clarisse Dhaenens, and El-Ghazali Talbi ............ 482

Evolutionary Mining of Rule Ensembles / Jorge Muruzbal .................................................................................. 487

Explanation-Oriented Data Mining / Yiyu Yao and Yan Zhao ................................................................................. 492

Factor Analysis in Data Mining / Zu-Hsu Lee, Richard L. Peterson, Chen-Fu Chien, and Ruben Xing ............... 498

Financial Ratio Selection for Distress Classification / Roberto Kawakami Harrop Galvo, Victor M. Becerra,
and Magda Abou-Seada ..................................................................................................................................... 503

TEAM LinG
Flexible Mining of Association Rules / Hong Shen ................................................................................................. 509

Formal Concept Analysis Based Clustering / Jamil M. Saquer .............................................................................. 514

Fuzzy Information and Data Analysis / Reinhard Viertl ......................................................................................... 519

General Model for Data Warehouses, A / Michel Schneider .................................................................................. 523

Genetic Programming / William H. Hsu .................................................................................................................... 529

Graph Transformations and Neural Networks / Ingrid Fischer ............................................................................... 534

Graph-Based Data Mining / Lawrence B. Holder and Diane J. Cook .................................................................... 540

Group Pattern Discovery Systems for Multiple Data Sources / Shichao Zhang and Chengqi Zhang ................... 546

Heterogeneous Gene Data for Classifying Tumors / Benny Yiu-ming Fung and Vincent To-yee Ng .................... 550

Hierarchical Document Clustering / Benjamin C. M. Fung, Ke Wang, and Martin Ester ....................................... 555

High Frequency Patterns in Data Mining / Tsau Young Lin .................................................................................... 560

Homeland Security Data Mining and Link Analysis / Bhavani Thuraisingham ..................................................... 566

Humanities Data Warehousing / Janet Delve .......................................................................................................... 570

Hyperbolic Space for Interactive Visualization / Jrg Andreas Walter ................................................................... 575

VOLUME II

Identifying Single Clusters in Large Data Sets / Frank Klawonn and Olga Georgieva ......................................... 582

Immersive Image Mining in Cardiology / Xiaoqiang Liu, Henk Koppelaar, Ronald Hamers,
and Nico Bruining ............................................................................................................................................... 586

Imprecise Data and the Data Mining Process / Marvin L. Brown and John F. Kros .............................................. 593

Incorporating the People Perspective into Data Mining / Nilmini Wickramasinghe .............................................. 599

Incremental Mining from News Streams / Seokkyung Chung, Jongeun Jun and Dennis McLeod ........................ 606

Inexact Field Learning Approach for Data Mining / Honghua Dai ......................................................................... 611

Information Extraction in Biomedical Literature / Min Song, Il-Yeol Song, Xiaohua Hu, and Hyoil Han .............. 615

Instance Selection / Huan Liu and Lei Yu ............................................................................................................... 621

Integration of Data Sources through Data Mining / Andreas Koeller .................................................................... 625

Intelligence Density / David Sundaram and Victor Portougal .............................................................................. 630

Intelligent Data Analysis / Xiaohui Liu ................................................................................................................... 634

TEAM LinG
Intelligent Query Answering / Zbigniew W. Ras and Agnieszka Dardzinska ........................................................ 639

Interactive Visual Data Mining / Shouhong Wang and Hai Wang .......................................................................... 644

Interscheme Properties Role in Data Warehouses / Pasquale De Meo, Giorgio Terracina,


and Domenico Ursino .......................................................................................................................................... 647

Inter-Transactional Association Analysis for Prediction / Ling Feng and Tharam Dillon .................................... 653

Interval Set Representations of Clusters / Pawan Lingras, Rui Yan, Mofreh Hogo, and Chad West .................... 659

Kernel Methods in Chemoinformatics / Huma Lodhi .............................................................................................. 664

Knowledge Discovery with Artificial Neural Networks / Juan R. Rabual Dopico, Daniel Rivero Cebrin,
Julin Dorado de la Calle, and Nieves Pedreira Souto .................................................................................... 669

Learning Bayesian Networks / Marco F. Ramoni and Paola Sebastiani ............................................................... 674

Learning Information Extraction Rules for Web Data Mining / Chia-Hui Chang and Chun-Nan Hsu .................. 678

Locally Adaptive Techniques for Pattern Classification / Carlotta Domeniconi and Dimitrios Gunopulos ........ 684

Logical Analysis of Data / Endre Boros, Peter L. Hammer, and Toshihide Ibaraki .............................................. 689

Lsquare System for Mining Logic Data, The / Giovanni Felici and Klaus Truemper ............................................ 693

Marketing Data Mining / Victor S.Y. Lo .................................................................................................................. 698

Material Acquisitions Using Discovery Informatics Approach / Chien-Hsing Wu and Tzai-Zang Lee ................. 705

Materialized Hypertext View Maintenance / Giuseppe Sindoni .............................................................................. 710

Materialized Hypertext Views / Giuseppe Sindoni ................................................................................................... 714

Materialized View Selection for Data Warehouse Design / Dimitri Theodoratos and Alkis Simitsis ..................... 717

Methods for Choosing Clusters in Phylogenetic Trees / Tom Burr ........................................................................ 722

Microarray Data Mining / Li M. Fu .......................................................................................................................... 728

Microarray Databases for Biotechnology / Richard S. Segall ................................................................................ 734

Mine Rule / Rosa Meo and Giuseppe Psaila ........................................................................................................... 740

Mining Association Rules on a NCR Teradata System / Soon M. Chung and Murali Mangamuri ....................... 746

Mining Association Rules Using Frequent Closed Itemsets / Nicolas Pasquier ................................................... 752

Mining Chat Discussions / Stanley Loh, Daniel Licthnow, Thyago Borges, Tiago Primo,
Rodrigo Branco Kickhfel, Gabriel Simes, Gustavo Piltcher, and Ramiro Saldaa ..................................... 758

Mining Data with Group Theoretical Means / Gabriele Kern-Isberner .................................................................. 763

Mining E-Mail Data / Steffen Bickel and Tobias Scheffer ....................................................................................... 768

TEAM LinG
Mining for Image Classification Based on Feature Elements / Yu-Jin Zhang ......................................................... 773

Mining for Profitable Patterns in the Stock Market / Yihua Philip Sheng, Wen-Chi Hou, and Zhong Chen ......... 779

Mining for Web-Enabled E-Business Applications / Richi Nayak ......................................................................... 785

Mining Frequent Patterns via Pattern Decomposition / Qinghua Zou and Wesley Chu ......................................... 790

Mining Group Differences / Shane M. Butler and Geoffrey I. Webb ....................................................................... 795

Mining Historical XML / Qiankun Zhao and Sourav Saha Bhowmick .................................................................. 800

Mining Images for Structure / Terry Caelli ............................................................................................................. 805

Mining Microarray Data / Nanxiang Ge and Li Liu ................................................................................................ 810

Mining Quantitative and Fuzzy Association Rules / Hong Shen and Susumu Horiguchi ..................................... 815

Model Identification through Data Mining / Diego Liberati .................................................................................. 820

Modeling Web-Based Data in a Data Warehouse / Hadrian Peter and Charles Greenidge ................................. 826

Moral Foundations of Data Mining / Kenneth W. Goodman ................................................................................... 832

Mosaic-Based Relevance Feedback for Image Retrieval / Odej Kao and Ingo la Tendresse ................................. 837

Multimodal Analysis in Multimedia Using Symbolic Kernels / Hrishikesh B. Aradhye and Chitra Dorai ............ 842

Multiple Hypothesis Testing for Data Mining / Sach Mukherjee ........................................................................... 848

Music Information Retrieval / Alicja A. Wieczorkowska ......................................................................................... 854

Negative Association Rules in Data Mining / Olena Daly and David Taniar ....................................................... 859

Neural Networks for Prediction and Classification / Kate A. Smith ......................................................................... 865

Off-Line Signature Recognition / Indrani Chakravarty, Nilesh Mishra, Mayank Vatsa, Richa Singh,
and P. Gupta ........................................................................................................................................................ 870

Online Analytical Processing Systems / Rebecca Boon-Noi Tan ........................................................................... 876

Online Signature Recognition / Indrani Chakravarty, Nilesh Mishra, Mayank Vatsa, Richa Singh,
and P. Gupta ........................................................................................................................................................ 885

Organizational Data Mining / Hamid R. Nemati and Christopher D. Barko .......................................................... 891

Path Mining in Web Processes Using Profiles / Jorge Cardoso ............................................................................. 896

Pattern Synthesis for Large-Scale Pattern Recognition / P. Viswanath, M. Narasimha Murty,


and Shalabh Bhatnagar ..................................................................................................................................... 902

Physical Data Warehousing Design / Ladjel Bellatreche and Mukesh Mohania .................................................. 906

Predicting Resource Usage for Capital Efficient Marketing / D. R. Mani, Andrew L. Betz, and James H. Drew .... 912

TEAM LinG
Privacy and Confidentiality Issues in Data Mining / Ycel Saygin ......................................................................... 921

Privacy Protection in Association Rule Mining / Neha Jha and Shamik Sural ..................................................... 925

Profit Mining / Senqiang Zhou and Ke Wang ......................................................................................................... 930

Pseudo Independent Models / Yang Xiang ............................................................................................................ 935

Reasoning about Frequent Patterns with Negation / Marzena Kryszkiewicz ......................................................... 941

Recovery of Data Dependencies / Hee Beng Kuan Tan and Yuan Zhao ................................................................ 947

Reinforcing CRM with Data Mining / Dan Zhu ........................................................................................................ 950

Resource Allocation in Wireless Networks / Dimitrios Katsaros, Gkhan Yavas, Alexandros Nanopoulos,
Murat Karakaya, zgr Ulusoy, and Yannis Manolopoulos ............................................................................ 955

Retrieving Medical Records Using Bayesian Networks / Luis M. de Campos, Juan M. Fernndez-Luna,
and Juan F. Huete ............................................................................................................................................... 960

Robust Face Recognition for Data Mining / Brian C. Lovell and Shaokang Chen ............................................... 965

Rough Sets and Data Mining / Jerzy W. Grzymala-Busse and Wojciech Ziarko .................................................... 973

Rule Generation Methods Based on Logic Synthesis / Marco Muselli .................................................................. 978

Rule Qualities and Knowledge Combination for Decision-Making / Ivan Bruha .................................................... 984

Sampling Methods in Approximate Query Answering Systems / Gautam Das ...................................................... 990

Scientific Web Intelligence / Mike Thelwall ............................................................................................................ 995

Search Situations and Transitions / Nils Pharo and Kalervo Jrvelin .................................................................. 1000

Secure Multiparty Computation for Privacy Preserving Data Mining / Yehida Lindell .......................................... 1005

Semantic Data Mining / Protima Banerjee, Xiaohua Hu, and Illhoi Yoo ............................................................... 1010

Semi-Structured Document Classification / Ludovic Denoyer and Patrick Gallinari ............................................ 1015

Semi-Supervised Learning / Tobias Scheffer ............................................................................................................ 1022

Sequential Pattern Mining / Florent Masseglia, Maguelonne Teisseire, and Pascal Poncelet ............................ 1028

Software Warehouse / Honghua Dai ....................................................................................................................... 1033

Spectral Methods for Data Clustering / Wenyuan Li ............................................................................................... 1037

Statistical Data Editing / Claudio Conversano and Roberta Siciliano .................................................................. 1043

Statistical Metadata in Data Processing and Interchange / Maria Vardaki ........................................................... 1048

Storage Strategies in Data Warehouses / Xinjian Lu .............................................................................................. 1054

TEAM LinG
Subgraph Mining / Ingrid Fischer and Thorsten Meinl .......................................................................................... 1059

Support Vector Machines / Mamoun Awad and Latifur Khan ............................................................................... 1064

Support Vector Machines Illuminated / David R. Musicant .................................................................................... 1071

Survival Analysis and Data Mining / Qiyang Chen, Alan Oppenheim, and Dajin Wang ...................................... 1077

Symbiotic Data Mining / Kuriakose Athappilly and Alan Rea .............................................................................. 1083

Symbolic Data Clustering / Edwin Diday and M. Narasimha Murthy .................................................................... 1087

Synthesis with Data Warehouse Applications and Utilities / Hakikur Rahman .................................................... 1092

Temporal Association Rule Mining in Event Sequences / Sherri K. Harms ........................................................... 1098

Text Content Approaches in Web Content Mining / Vctor Fresno Fernndez and Luis Magdalena Layos ....... 1103

Text Mining-Machine Learning on Documents / Dunja Mladeni ........................................................................ 1109

Text Mining Methods for Hierarchical Document Indexing / Han-Joon Kim ......................................................... 1113

Time Series Analysis and Mining Techniques / Mehmet Sayal .............................................................................. 1120

Time Series Data Forecasting / Vincent Cho ........................................................................................................... 1125

Topic Maps Generation by Text Mining / Hsin-Chang Yang and Chung-Hong Lee ............................................. 1130

Transferable Belief Model / Philippe Smets ............................................................................................................ 1135

Tree and Graph Mining / Dimitrios Katsaros and Yannis Manolopoulos ............................................................. 1140

Trends in Web Content and Structure Mining / Anita Lee-Post and Haihao Jin .................................................. 1146

Trends in Web Usage Mining / Anita Lee-Post and Haihao Jin ............................................................................ 1151

Unsupervised Mining of Genes Classifying Leukemia / Diego Liberati, Sergio Bittanti,


and Simone Garatti ............................................................................................................................................. 1155

Use of RFID in Supply Chain Data Processing / Jan Owens, Suresh Chalasani,
and Jayavel Sounderpandian ............................................................................................................................. 1160

Using Dempster-Shafer Theory in Data Mining / Malcolm J. Beynon .................................................................... 1166

Using Standard APIs for Data Mining in Prediction / Jaroslav Zendulka .............................................................. 1171

Utilizing Fuzzy Decision Trees in Decision Making / Malcolm J. Beynon .............................................................. 1175

Vertical Data Mining / William Perrizo, Qiang Ding, Qin Ding, and Taufik Abidin .............................................. 1181

Video Data Mining / JungHwan Oh, JeongKyu Lee, and Sae Hwang .................................................................... 1185

Visualization Techniques for Data Mining / Herna L. Viktor and Eric Paquet ...................................................... 1190

TEAM LinG
Wavelets for Querying Multidimensional Datasets / Cyrus Shahabi, Dimitris Sacharidis,
and Mehrdad Jahangiri ...................................................................................................................................... 1196

Web Mining in Thematic Search Engines / Massimiliano Caramia and Giovanni Felici .................................... 1201

Web Mining Overview / Bamshad Mobasher ......................................................................................................... 1206

Web Page Extension of Data Warehouses / Anthony Scime ................................................................................... 1211

Web Usage Mining / Bamshad Mobasher ............................................................................................................... 1216

Web Usage Mining and Its Applications / Yongjian Fu ......................................................................................... 1221

Web Usage Mining Data Preparation / Bamshad Mobasher ................................................................................... 1226

Web Usage Mining through Associative Models / Paolo Giudici and Paola Cerchiello .................................... 1231

World Wide Web Personalization / Olfa Nasraoui ................................................................................................. 1235

World Wide Web Usage Mining / Wen-Chen Hu, Hung-Jen Yang, Chung-wei Lee, and Jyh-haw Yeh ............... 1242

TEAM LinG
xxi

Foreword

There has been much interest developed in the data mining field both in the academia and the industry over the past
10-15 years. The number of researchers and practitioners working in the field and the number of scientific papers
published in various data mining outlets increased drastically over this period. Major commercial vendors incorporated
various data mining tools into their products, and numerous applications in many areas, including life sciences, finance,
CRM, and Web-based applications, have been developed and successfully deployed.
Moreover, this interest is no longer limited to the researchers working in the traditional fields of statistics, machine
learning and databases, but has recently expanded to other fields, including operations research/management science
(OR/MS) and mathematics, as evidenced from various data mining tracks organized at different INFORMS meetings,
special issues of OR/MS journals and the recent conference on Mathematical Foundations of Learning Theory
organized by mathematicians.
As the Encyclopedia of Data Warehousing and Mining amply demonstrates, all these diverse interests from
different groups of researchers and practitioners helped to shape data mining as a broad and multi-faceted discipline
spanning a large class of problems in such diverse areas as life sciences, marketing (including CRM and e-commerce),
finance, telecommunications, astronomy, and many other fields (the so called data mining and X phenomenon, where
X constitutes a broad range of fields where data mining is used for analyzing the data). This also resulted in a process
of cross-fertilization of ideas generated by these diverse groups of researchers interacting across the traditional
boundaries of their disciplines.
Despite all this progress, data mining still faces several challenges that make the field ripe with future research
opportunities. First, despite the cross-fertilization of ideas spanning various disciplines, the convergence among
different disciplines proceeds gradually, and more work is required to arrive at a unified view of data mining widely
accepted by different groups of researchers. Second, despite a considerable progress, still more work is required on
the theoretical foundations of data mining, as was recently stated by the participants of the Dagstuhl workshop Data
Mining: The Next Generation organized by R. Agrawal, J.-C. Freytag and R. Ramakrishnan and also expressed by
various other data mining researchers. Third, the data mining community must address the privacy and security
problems for data mining to be accepted by the privacy advocates and the Congress. Fourth, as the field advances, so
is the scope of data mining applications. The challenge to the field is to develop more advanced data mining methods
that would work in these increasingly demanding applications. Fifth, despite a considerable progress in developing more
user-friendly data mining tools, more work is required in this area with the goal of making these tools accessible to a
large audience of nave data mining users. In particular, one of the challenges is to devise methods that would
smoothly embed data mining tools into corresponding applications on the front-end and would integrate these tools
with databases on the back-end. Achieving such capabilities is very important since this would allow data mining to
cross the chasm (using Geoffrey Moores terminology) and become a mainstream technology utilized by millions of
users. Finally, more work is required on actionability and on the development of better methods for discovering
actionable patterns in the data. Currently, discovering actionable patterns in data constitutes a laborious and
challenging process. It is important to streamline and simplify this process and make it more efficient.
Given significant and rapid advancements in data mining and data warehousing, it is important to take periodic
snapshots of the field every few years. The data mining community addressed this issue by producing publications
covering the state of the art of the field every few years starting with the first volume Advances in Knowledge
Discovery and Data Mining (edited by U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy) published by
AAAI/MIT Press in 1996. This encyclopedia provides the latest snapshot of the field and surveys a broad array of
topics ranging from the basic theories to the recent advancements in the field and covers a diverse range of problems

TEAM LinG
xxii

from the analysis of microarray data to the analysis of multimedia and Web data. It also identifies future directions and
trends in data mining and data warehousing. Therefore, this volume should become an excellent guide to researchers
and practitioners.

Alexander Tuzhilin
New York University, USA

TEAM LinG
xxiii

Preface

How can a data-flooded manager get out of the mire? How can a confused decision maker pass through a maze?
How can an over-burdened problem solver clean up a mess? How can an exhausted scientist decipher a myth?
The answer is an interdisciplinary subject and a powerful tool known as data mining (DM). DM can turn data into
dollars; transform information into intelligence; change pattern into profit; and convert relationship into resources.
As the third branch of operations research and management science (OR/MS) and the third milestone of data
management, DM can help attack the third category of decision making by elevating our raw data into the third stage
of knowledge creation.
The term third has been mentioned four times above. Lets go backward and look at the three stages of knowledge
creation. Managers are often drowning in data (the first stage) but starving for knowledge. A collection of data is not
information (the second stage); and a collection of information is not knowledge. Data begets information which begets
knowledge. The whole subject of DM has a synergy of its own and represents more than the sum of its parts.
There are three categories of decision making: structured, semi-structured and unstructured. Decision making
processes fall along a continuum that ranges from highly structured decisions (sometimes called programmed) to highly
unstructured (non-programmed) decisions (Turban et al., 2005, p. 12).
At one end of the spectrum, structured processes are routine and typically repetitive problems for which standard
solutions exist. Unfortunately, rather than being static, deterministic and simple, the majority of real world problems
are dynamic, probabilistic, and complex. Many professional and personal problems are classified as unstructured, or
marginally as semi-structured, or even in between, since the boundaries between them may not be crystal-clear.
In addition to developing normative models (such as linear programming, economic order quantity) for solving
structured (or programmed) problems, operation researchers and management scientists have created many descriptive
models, such as simulation and goal programming, to deal with semi-structured alternatives. Unstructured problems,
however, fall in a gray areas for which there are no cut-and-dry solution methods. The current two branches of OR/MS
hit a dead end with unstructured problems.
To gain knowledge, one must understand the patterns that emerge from information. Patterns are not just simple
relationships among data; they exist separately from information, as archetypes or standards to which emerging
information can be compared so that one may draw inferences and take action. Over the last 40 years, the tools and
techniques used to process data and information have continued to evolve from databases (DBs) to data warehousing
(DW) and further to DM. DW applications, the middle of these three stages, have become business-critical. However,
DM can help deliver even more value from these huge repositories of information.
Certainly, there are many statistical models that have emerged over time. Machine learning has marked a milestone
in the evolution of computer science (Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996). Although DM is still
in its infancy, it is now being used in a wide range of industries and for a range of tasks in a variety of contexts (Wang,
2003). DM is synonymous with knowledge discovery in databases, knowledge extraction, data/pattern analysis, data
archeology, data dredging, data snooping, data fishing, information harvesting, and business intelligence (Giudici,
2003; Hand et al., 2001; Han & Kamber, 2000). There are unprecedented opportunities in the future to utilize DM.
Data warehousing and mining (DWM) is the science of managing and analyzing large datasets and discovering novel
patterns. In recent years, DWM has emerged as a particularly exciting and industrially relevant area of research.
Prodigious amounts of data are now being generated in domains as diverse and elusive as market research, functional
genomics and pharmaceuticals. Intelligently analyzing data to discover knowledge with the aim of answering crucial
questions and helping make informed decisions is the challenge that lies ahead.
The Encyclopedia of Data Warehousing and Mining provides theories, methodologies, functionalities, and
applications to decision makers, problem solvers, and data miners in business, academia, and government. DWM lies

TEAM LinG
xxiv

at the junction of database systems, artificial intelligence, machine learning and applied statistics, which makes it a
valuable area for researchers and practitioners. With a comprehensive overview, The Encyclopedia of Data Warehous-
ing and Mining offers a thorough exposure to the issues of importance in this rapidly changing field. The encyclopedia
also includes a rich mix of introductory and advanced topics while providing a comprehensive source of technical,
functional and legal references to DWM.
After spending more than a year preparing this book, with a strictly peer-reviewed process, I am delighted to see
it published. The standard for selection was very high. Each article went through at least three peer reviews; additional
third-party reviews were sought in cases of controversy. There have been innumerable instances where this feedback
has helped to improve the quality of the content, and even influenced authors in how they approach their topics.
The primary objective of this encyclopedia is to explore the myriad of issues regarding DWM. A broad spectrum
of practitioners, managers, scientists, educators, and graduate students who teach, perform research, and/or implement
these discoveries, are the envisioned readers of this encyclopedia.
The encyclopedia contains a collection of 234 articles, written by an international team of 361 experts representing
leading scientists and talented young scholars from 34 countries. They have contributed great effort to create a source
of solid, practical information, informed by sound underlying theory that should become a resource for all people
involved in this dynamic new field. Lets take a peek at a few articles:
The evaluation of DM methods requires a great deal of attention. A valid model evaluation and comparison can
improve considerably the efficiency of a DM process. Paolo Giudici has presented several ways to perform model
comparison, in which each has its advantages and disadvantages.
According to Zbigniew W. Ras, the main object of action rules is to generate special types of rules for a database
that point the direction for re-classifying objects with respect to some distinguishing attributes (called decision
attributes). This creates flexible attributes that form a basis for action rules construction.
With the constraints imposed by computer memory and mining algorithms, we can experience selection pressures
more than ever. The main point of instance selection is approximation. Our task is to achieve as good mining results
as possible by approximating the whole dataset with the selected instances and hope to do better in DM with instance
selection as it is possible to remove noisy and irrelevant data in the process. Huan Liu and Lei Yu have presented an
initial attempt to review and categorize the methods of instance selection in terms of sampling, classification, and
clustering.
Shichao Zhang and Chengqi Zhang introduce a group of pattern discovery systems for dealing with the multiple
data source (MDS) problem, mainly including a logical system for enhancing data quality; a logical system for resolving
conflicts; a data cleaning system; a database clustering system; a pattern discovery system and a post-mining system.
Based on his extensive experience, Gautam Das surveys recent state-of-the-art solutions to the problem of
approximate query answering in databases, in which ballpark answers(i.e., approximate answers) to queries can be
provided within acceptable time limits. These techniques sacrifice accuracy to improve running time; typically through
some sort of lossy data compression. Also, Han-Joon Kim (the holder of two patents on text mining applications)
discusses a comprehensive text-mining solution to document indexing problems on topic hierarchies (taxonomy).
Condensed representations have been proposed as a useful concept for the optimization of typical DM tasks. It
appears as a key concept within the emerging inductive DB framework where inductive query evaluation needs for
effective constraint-based DM techniques. Jean-Franois Boulicaut introduces this research domain, its achievements
in the context of frequent itemset mining from transactional data and its future trends.
Zhi-Hua Zhou discusses complexity issues in DM. Although we still have a long way to go in order to produce
patterns that can be understood by most people involved with DM tasks, endeavors on improving the comprehensibility
of complicated algorithms have proceeded at a promising pace.
Pattern classification poses a difficult challenge in finite settings and high dimensional spaces caused by the issue
of dimensionality. Carlotta Domeniconi and Dimitrios Gunopulos discuss classification techniques, including the
authors own work, to mitigate the problem of dimensionality and reduce bias, by estimating local feature relevance and
selecting features accordingly. This issue has both theoretical and practical relevance, since learning tasks abound in
which data are represented as a collection of a very large numbers of features. Thus, many applications can benefit from
improvements in predicting error.
Qinghua Zou proposes using pattern decomposition algorithms to find frequent patterns in large datasets. Pattern
decomposition is a DM technology that uses known, frequent or infrequent patterns to decompose long itemsets to
many short ones. It identifies frequent patterns in a dataset using a bottom-up methodology and reduces the size of
the dataset in each step. The algorithm avoids the process of candidate set generation and decreases the time for
counting supports due to the reduced dataset.

TEAM LinG
xxv

Perrizo, Ding, et al. review a category of DM approaches using vertical data structures. They demonstrate their
applications in various DM areas, such as association rule mining and multi-relational DM. Vertical DM strategy aims
at addressing scalability issues by organizing data in vertical layouts and conducting logical operations on vertically
partitioned data instead of scanning the entire DB horizontally.
Integration of data sources refers to the task of developing a common schema, as well as data transformation
solutions, for a number of data sources with related content. The large number and size of modern data sources makes
manual approaches to integration increasingly impractical. Andreas Koeller provides a comprehensive overview over
DM techniques which can help to partially or fully automate the data integration process.
DM applications often involve testing hypotheses regarding thousands or millions of objects at once. The statistical
concept of multiple hypothesis testing is of great practical importance in such situations, and an appreciation of the
issues involved can vastly reduce errors and associated costs. Sach Mukherjee provides an introductory look at multiple
hypothesis testing in the context of DM.
Maria Vardaki illustrates the benefits of using statistical metadata by information systems, depicting also how such
standardization can improve the quality of statistical results. She proposes a common, semantically rich, and object-
oriented data/metadata model for metadata management that integrates the main steps of data processing and covers
all aspects of DW that are essential for DM requirements. Finally, she demonstrates how a metadata model can be
integrated in a web-enabled statistical information system to ensure quality of statistical results.
A major obstacle in DM applications is the gap between statistic-based pattern extraction and value-based decision-
making. Profit mining aims at reducing this gap. The concept and techniques proposed by Ke Wang and Senqiang Zhou
are applicable to applications under a general notion of utility.
Although a tremendous amount of progress has been made in DM over the last decade or so, many important
challenges still remain. For instance, there are still no solid standards of practice; it is still too easy to misuse DM
software; secondary data analysis without appropriate experimental design is still common; and it is still hard to choose
right kind of analysis methods for the problem in hand. Xiao Hui Liu points out that intelligent data analysis (IDA)
is an interdisciplinary study concerning the effective analysis of data, which may help advance the state of art in the
field.
In recent years, the need to extract complex tree-like or graph-like patterns in massive data collections (e.g., in
bioinformatics, semistructured or Web DBs) has become a necessity. This has led to the emergence of the research field
of graph and tree mining. This field provides many promising topics for both theoretical and engineering achievements,
and many expect this to be one of the key fields in DM research in the years ahead. Katsaros and Manolopoulos review
the most important strategic application-domains where frequent structure mining (FSM) provides significant results.
A survey is presented of the most important algorithms that have been proposed for mining graph-like and tree-like
substructures in massive data collections.
Lawrence B. Holder and Diane J. Cook are among the pioneers in the field of graph-based DM and have developed
the widely-disseminated Subdue graph-based DM system (http://ailab.uta.edu/subdue). They have directed multi-
million dollar government-funded projects in the research, development and application of graph-based DM in real-
world tasks ranging from bioinformatics to homeland security.
Graphical models such as Bayesian networks (BNs) and decomposable Markov networks (DMNs) have been widely
applied to probabilistic reasoning in intelligent systems. Automatic discovery of such models from data is desirable,
but is NP-hard in general. Common learning algorithms use single-link look-ahead searches for efficiency. However,
pseudo-independent (PI) probabilistic domains are not learnable by such algorithms. Yang Xiang introduces funda-
mentals of PI domains and explains why common algorithms fail to discover them. He further offers key ideas as to how
they can efficiently be discovered, and predicts advances in the near future.
Semantic DM is a novel research area that used graph-based DM techniques and ontologies to identify complex
patterns in large, heterogeneous data sets. Tony Hus research group at Drexel University is involved in the
development and application of semantic DM techniques to the bioinformatics and homeland security domains.
Yu-Jin Zhang presents a novel method for image classification based on feature element through association rule
mining. The feature elements can capture well the visual meanings of images according to the subjective perception
of human beings, and are suitable for working with rule-based classification models. Techniques are adapted for mining
the association rules which can find associations between the feature elements and class attributes of the image, and
the mined rules are applied to image classifications.
Results of image DB queries are usually presented as a thumbnail list. Subsequently, each of these images can be
used for refinement of the initial query. This approach is not suitable for queries by sketch. In order to receive the desired
images, the user has to recognize misleading areas of the sketch and modify these images appropriately. This is a non-

TEAM LinG
xxvi

trivial problem, as the retrieval often is based on complex, non-intuitive features. Therefore, Odej Kao presents a mosaic-
based technique for sketch feedback, which combines the best sections contained in an image DB into a single query
image.
Andrew Kusiak and Shital C. Shah emphasize the need for an individual-based paradigm, which may ensure the well-
being of patients and the success of pharmaceutical industry. The new methodologies are illustrated with various
medical informatics research projects on topics such as predictions for dialysis patients, significant gene/SNP
identifications, hypoplastic left heart syndrome for infants, and epidemiological and clinical toxicology. DWM and data
modeling will ultimately lead to targeted drug discovery and individualized treatments with minimum adverse effects.
The use of microarray DBs has revolutionized the way in which biomedical research and clinical investigation can
be conducted in that high-density arrays of specified DNA sequences can be fabricated onto a single glass slide or
chip. However, the analysis and interpretation of the vast amount of complex data produced by this technology poses
an unprecedented challenge. LinMin Fu and Richard Segall present a state-of-the-art review of microarray DM problems
and solutions.
Knowledge discovery from genomic data has become an important research area for biologists. An important
characteristic of genomic applications is the very large amount of data to be analyzed, and most of the time, it is not
possible to apply only classical statistical methods. Therefore, Jourdan, Dhaenens and Talbi propose to model
knowledge discovery tasks associated with such problems as combinatorial optimization tasks, in order to apply
efficient optimization algorithms to extract knowledge from those large datasets.
Founded on the work of Indrani Chakravarty et al.s research, handwritten signature is a behavioral biometric. There
are two methods used for recognition of handwritten signatures offline and online. While offline methods extract static
features of signature instances by treating them as images, online methods extract and use temporal or dynamic features
of signatures for recognition purposes. Temporal features are difficult to imitate, and hence online recognition methods
offer higher accuracy rates than offline methods.
Neurons are small processing units that are able to store some information. When several neurons are connected,
the result is a neural network, a model inspired by biological neural networks like the brain. Kate Smith provides useful
guidelines to ensure successful learning and generalization of the neural network model. Also, a special version in the
form of probabilistic neural networks (PNNs) is explained by Ingrid Fischer with the help of graphic transformations.
The sheer volume of multimedia data available has exploded on the Internet in the past decade in the form of webcasts,
broadcast programs and streaming audio and video. Automated content analysis tools for multimedia depend on face
detectors and recognizers; videotext extractors; speech and speaker identifiers; people/vehicle trackers; and event
locators resulting in large sets of multimodal features that can be real-valued, discrete, ordinal, or nominal. Multimedia
metadata based on such a multimodal collection of features, poses significant difficulties to subsequent tasks such as
classification, clustering, visualization and dimensionality reduction which traditionally deal only with continuous-
valued data. Aradhye and Dorai discuss mechanisms that extend tasks traditionally limited to continuous-valued feature
spaces to multimodal multimedia domains with symbolic and continuous-valued features, including (a) dimensionality
reduction, (b) de-noising, (c) visualization, and (d) clustering.
Brian C. Lovell and Shaokang Chen review the recent advances in the application of face recognition for multimedia
DM. While the technology for mining text documents in large DBs could be said to be relatively mature, the same cannot
be said for mining other important data types such as speech, music, images and video. Yet these forms of multimedia
data are becoming increasingly common on the Internet and intranets.
The goal of Web usage mining is to capture, model, and analyze the behavioral patterns and profiles of users
interacting with a Web site. Bamshad Mobasher and Yongjian Fu provide an overview of the three primary phases of
the Web mining process: data preprocessing, pattern discovery, and pattern analysis. The primary focus of their articles
is on the types of DM and analysis tasks most commonly used in Web usage mining, as well as some their typical
applications in areas such as Web personalization and Web analytics. Ji-Rong Wen explores the ways of enhancing
Web search using query log mining and Web structure mining.
In line with Mike Thelwalls opinion, scientific Web intelligence (SWI) is a research field that combines techniques
from DM, Web intelligence and scientometrics to extract useful information from the links and text of academic-related
Web pages, using various clustering, visualization and counting techniques. SWI is a type of Web mining that combines
Web structure mining and text mining. Its main uses are in addressing research questions concerning the Web, or Web-
related phenomena, rather than in producing commercially useful knowledge.
Web-enabled electronic business is generating massive amount of data on customer purchases, browsing patterns,
usage times and preferences at an increasing rate. DM techniques can be applied to all the data being collected. Richi
Nayak presents issues associated with DM for Web-enabled electronic-business.

TEAM LinG
xxvii

Tobias Scheffer gives an overview of common email mining tasks including email filing, spam filtering and mining
communication networks. The main section of his work focuses on recent developments in mining email data for support
of the message creation process. Approaches to mining question-answer pairs and sentences are also reviewed.
Stanley Loh describes a computer-supported approach to mine discussions that occurred in chat rooms. Dennis
Mcleod explores incremental mining from news streams. JungHwan Oh summarizes the current status of video DM. J.
Ben Schafer addresses the technology used to generate recommendations.
In the abstract, a DW can be seen as a set of materialized views defined over source relations. During the initial design
of a DW, the designer faces the problem of deciding which views to materialize in the DW. This problem has been
addressed in the literature for different classes of queries and views, and with different design goals. Theodoratos and
Simitsis identify the different design goals used to formulate alternative versions of the problem and highlight the
techniques used to solve it.
Michel Schneider addresses the problem of designing a DW schema. He suggested a general model for this purpose
that integrates a majority of existing models: the notion of a well-formed structure is proposed to help design the process;
a graphic representation is suggested for drawing well-formed structures; and the classical star-snowflake structure
is represented.
Anthony Scime presents a methodology for adding external information from the World Wide Web to a DW, in
addition to the DWs domain information. The methodology assures decision makers that the added Web based data
are relevant to the purpose and current data of the DW.
Privacy and confidentiality of individuals are important issues in the information technology age. Advances in DM
technology have increased privacy concerns even more. Jack Cook and Ycel Sayg1n highlight the privacy and
confidentiality issues in DM, and survey state of the art solutions and approaches for achieving privacy preserving
DM.
Ken Goodman provides one of the first overviews of ethical issues that arise in DM. He shows that while privacy
and confidentiality often are paramount in discussions of DM, other issues including the characterization of
appropriate uses and users, and data miners intentions and goals must be considered. Machine learning in genomics
and in security surveillance are set aside as special issues requiring attention.
Increased concern about privacy and information security has led to the development of privacy preserving DM
techniques. Yehuda Lindell focuses on the paradigms for defining security in this setting, and the need for a rigorous
approach. Shamik Sural et al. present some of important approaches to privacy protection in association rule mining.
Human-computer interaction is crucial in the knowledge discovery process in order to accomplish a variety of novel
goals of DM. In Shou Hong Wangs opinion, interactive visual DM is human-centered DM, implemented through
knowledge discovery loops coupled with human-computer interaction and visual representations.
Symbiotic DM is an evolutionary approach that shows how organizations analyze, interpret, and create new
knowledge from large pools of data. Symbiotic data miners are trained business and technical professionals skilled in
applying complex DM techniques and business intelligence tools to challenges in a dynamic business environment.
Athappilly and Rea opened the discussion on how businesses and academia can work to help professionals learn, and
fuse the skills of business, IT, statistics, and logic to create the next generation of data miners.
Yiyu Yao and Yan Zhao first make an immediate comparison between scientific research and DM and add an
explanation construction and evaluation task to the existing DM framework. Explanation-oriented DM offers a new
perspective, which has a significant impact on the understanding of the complete process of DM and effective
applications of DM results.
Traditional DM views the output from any DM initiative as a homogeneous knowledge product. Knowledge
however, always is a multifaceted construct, exhibiting many manifestations and forms. It is the thesis of Nilmini
Wickramasinghes discussion that a more complete and macro perspective, and a more balanced approach to knowledge
creation, can best be provided by taking a broader perspective of the knowledge product resulting from the KDD
process: namely, by incorporating a people-based perspective into the traditional KDD process, and viewing knowledge
as the multifaceted construct it is. This in turn will serve to enhance the knowledge base of an organization, and facilitate
the realization of effective knowledge.
Fabrice Muhlenbach and Ricco Rakotomalala are the authors of an original supervised multivariate discretization
method called HyperCluster Finder. Their major contributions to the research community are present in a DM software
called TANAGRA, which is freely available on Internet.
Recently there have been many efforts to apply DM techniques to security problems, including homeland security
and cyber security. Bhavani Thuraisingham (the inventor of three patents for MITRE) examines some of these
developments in DM in general and link analysis in particular, and shows how DM and link analysis techniques may
be applied for homeland security applications. Some emerging trends are also discussed.

TEAM LinG
xxviii

In order to reduce financial statement errors and fraud, Garrity, ODonnell and Sanders proposed an architecture
that provides auditors with a framework for an effective continuous auditing environment that utilizes DM.
The applications of DWM are everywhere: from Kernel Methods in Chemoinformatics to Data Mining for Damage
Detection in Engineering Structures; from Predicting Resource Usage for Capital Efficient Marketing to Mining for
Profitable Patterns in the Stock Market; from Financial Ratio Selection for Distress Classification to Material
Acquisitions Using Discovery Informatics Approach; from Resource Allocation in Wireless Networks to Reinforcing
CRM with Data Mining; from Data Mining Medical Digital Libraries to Immersive Image Mining in Cardiology; from
Data Mining in Diabetes Diagnosis and Detection to Distributed Data Management of Daily Car Pooling Problems;
and from Mining Images for Structure to Automatic Musical Instrument Sound ClassificationThe list of DWM
applications is endless and the future of DWM is promising.
Knowledge explosion pushes DWM, a multidisciplinary subject, to ever-expanding regions. Inclusion, omission,
emphasis, evolution and even revolution are part of our professional life. In spite of our efforts to be careful, should
you find any ambiguities or perceived inaccuracies, please contact me at wangj@mail.montclair.edu.

REFERENCES

Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery and data
mining. AAAI/MIT Press.
Giudici, P. (2003). Applied data mining: Statistical methods for business and industry. John Wiley.
Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. Morgan Kaufmann Publishers.
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. MIT Press.
Turban, E., Aronson, J. E., & Liang, T. P. (2005). Decision support systems and intelligent systems. Upper Saddle River,
NJ: Pearson Prentice Hall.
Wang, J. (2003). Data mining: Opportunities and challenges. Hershey, PA: Idea Group Publishing.

TEAM LinG
xxix

Acknowledgments

The editor would like to thank all of the authors for their insights and excellent contributions to this book. I also want
to thank the group of anonymous reviewers who assisted me in the peer-reviewing process and provided comprehen-
sive, critical and, constructive reviews. Each Editorial Advisory Board member has made a big contribution in guidance
and assistance.
The editor wishes to acknowledge the help of all involved in the development process of this book, without whose
support the project could not have been satisfactorily completed. Linxi Liao and MinSun Ku, two graduate assistants,
are hereby graciously acknowledged for their diligent work. I owe my thanks to Karen Dennis for lending a hand in the
tedious process of proof-reading. A further special note of thanks goes to the staff at Idea Group Inc., whose
contributions have been invaluable throughout the entire process, from inception to final publication. Particular thanks
go to Sara Reed, Jan Travers, and Rene Davies, who continuously prodded via e-mail to keep the project on schedule,
and to Mehdi Khosrow-Pour, whose enthusiasm motivated me to accept his invitation to join this project.
My appreciation is also due the Global Education Center at MSU for awarding me a Global Education Fund. I would
also like to extend my thanks to my brothers Zhengxian, Shubert (an artist, http://www.portraitartist.com/wang/bio.htm),
and sister Jixian, who stood solidly behind me and contributed in their own sweet little ways. Thanks go to all Americans,
since it would not have been possible for the four of us to come to the U.S. without their support of different scholarships.
Finally, I want to thank my family: my parents, Houde Wang and Junyan Bai for their encouragement; my wife Hongyu
for her unfailing support, and my son Leigh for being without a dad during this project. By the way, our second boy
Leon was born on August 4, 2004. Like a baby, DWM has a bright and promising future.

John Wang, Ph.D.


Wangj@mail.montclair.edu
Dept. Information and Decision Sciences
School of Business
Montclair State University
Montclair, New Jersey
USA

TEAM LinG
xxx

About the Editor

John Wang, Ph.D., is a full professor in the Department of Information and Decision Sciences at Montclair State
University (MSU), USA. Professor Wang has published 89 refereed papers and three books. He is on the editorial board
of the International Journal of Cases on Electronic Commerce and has been a guest editor and referee for Operations
Research, IEEE Transactions on Control Systems Technology, and many other highly prestigious journals. His long-
term research goal is on the synergy of operations research, data mining and cybernetics.

TEAM LinG
1

Action Rules A
Zbigniew W. Ras
University of North Carolina, Charlotte, USA

Angelina Tzacheva
University of North Carolina, Charlotte, USA

Li-Shiang Tsay
University of North Carolina, Charlotte, USA

INTRODUCTION another bank. Sending invitations by regular mail to all


these customers or inviting them personally by giving
There are two aspects of interestingness of rules that them a call are examples of an action associated with
have been studied in data mining literature, objective that action rule.
and subjective measures (Liu, 1997; Adomavicius & In paper by Ras & Gupta (2002), authors assume that
Tuzhilin, 1997; Silberschatz & Tuzhilin, 1995, 1996). information system is distributed and its sites are au-
Objective measures are data-driven and domain-inde- tonomous. They show that it is wise to search for action
pendent. Generally, they evaluate the rules based on rules at remote sites when action rules extracted at the
their quality and similarity between them. Subjective client site cannot be implemented in practice (sug-
measures, including unexpectedness, novelty and ac- gested actions are too expensive or too risky). The
tionability, are user-driven and domain-dependent. composition of two action rules, not necessary ex-
A rule is actionable if user can do an action to his/her tracted at the same site, was defined in Ras & Gupta
advantage based on this rule (Liu, 1997). This defini- (2002). Authors gave assumptions guaranteeing the cor-
tion, in spite of its importance, is too vague and it leaves rectness of such a composition. One of these assump-
open door to a number of different interpretations of tions requires that semantics of attributes, including the
actionability. In order to narrow it down, a new class of interpretation of null values, have to be the same at both
rules (called action rules) constructed from certain sites. This assumption is relaxed in Tzacheva & Ras
pairs of association rules, has been proposed in Ras & (2004) since authors allow different granularities of the
Wieczorkowska (2000). A formal definition of an ac- same attribute at involved sites. In the same paper, they
tion rule was independently proposed in Geffner & introduce the notion of a cost and feasibility of an action
Wainer (1998). These rules have been investigated fur- rule. Usually, a number of action rules or chains of
ther in Tsay & Ras (2004) and Tzacheva & Ras (2004). action rules can be applied to reclassify a certain set of
To give an example justifying the need of action rules, objects. The cost associated with changes of values
let us assume that a number of customers have closed within one attribute is usually different than the cost
their accounts at one of the banks. We construct, possi- associated with changes of values within another at-
bly the simplest, description of that group of people and tribute. The strategy for replacing the initially extracted
next search for a new description, similar to the one we action rule by a composition of new action rules, dy-
have, with a goal to identify a new group of customers namically built, was proposed in the paper by Tzacheva
from which no one left that bank. If these descriptions & Ras (2004). This composition of rules uniquely de-
have a form of rules, then they can be seen as actionable fines a new action rule and it was built with a goal to
rules. Now, by comparing these two descriptions, we lower the cost of reclassifying objects supported by the
may find the cause why these accounts have been closed initial action rule.
and formulate an action, which if undertaken by the bank,
may prevent other customers from closing their ac-
counts. Such actions are stimulated by action rules and BACKGROUND
they are seen as precise hints for actionability of rules.
For example, an action rule may say that by inviting In the paper by Ras & Wieczorkowska (2000), the
people from a certain group of customers for a glass of notion of an action rule was introduced. The main idea
wine by the bank, it is guaranteed that these customers was to generate, from a database, special type of rules
will not close their accounts and they do not move to which basically form a hint to users showing a way to

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Action Rules

reclassify objects with respect to some distinguished Otherwise, it is called flexible. Date of Birth is an example
attribute (called a decision attribute). Clearly, each of a stable attribute. Interest rate on any customer
relational schema gives a list of attributes used to account is an example of a flexible attribute. For simplicity
represent objects stored in a database. Values of some reasons, we will consider decision tables with only one
of these attributes, for a given object, can be changed decision. We adopt the following definition of a decision
and this change can be influenced and controlled by table:
user. However, some of these changes (for instance By a decision table we mean an information system
profit) cannot be done directly to a decision attribute. S = (U, A1 A2 {d}), where d A1 A2 is a distin-
In such a case, definitions of this decision attribute in guished attribute called decision. The elements of A1 are
terms of other attributes (called classification attributes) called stable conditions, whereas the elements of A2
have to be learned. These new definitions are used to {d} are called flexible conditions. Our goal is to change
construct action rules showing what changes in values of values of attributes in A1 for some objects from U so the
some attributes, for a given class of objects, are needed values of the attribute d for these objects may change as
to reclassify objects the way users want. But users may well. Certain relationships between attributes from A1 and
still be either unable or unwilling to proceed with ac- the attribute d will have to be discovered first.
tions leading to such changes. In all such cases, we may By Dom(r) we mean all attributes listed in the IF part
search for definitions of values of any classification of a rule r extracted from S. For example, if r = [
attribute listed in an action rule. By replacing a value of (a1,3)*(a2,4) (d,3)] is a rule, then Dom(r) = {a1,a2}.
such attribute by its definition extracted either locally By d(r) we denote the decision value of rule r. In our
or at remote sites (if system is distributed), we con- example d(r) = 3.
struct new action rules, which might be of more interest If r1, r2 are rules and B A1 A2 is a set of attributes,
to business users than the initial rule.
then r1/B = r2/B means that the conditional parts of rules
r1, r2 restricted to attributes B are the same.
For example, if r1 = [(a1,3) (d,3)], then r1/{a1} = r/
MAIN THRUST {a1}.
Assume also that (a, v w) denotes the fact that the
The technology dimension will be explored to clarify value of attribute a has been changed from v to w.
the meaning of actionable rules including action rules Similarly, the term (a, v w)(x) means that a(x)=v has
and extended action rules. been changed to a(x)=w. Saying another words, the
property (a,v) of an object x has been changed to prop-
Action Rules Discovery in a Stand- erty (a,w). Assume now that rules r1, r2 have been
alone Information System extracted from S and r1/A1 = r2/A1, d(r1)=k1, d(r2)=k2
and k1< k2. Also, assume that (b1, b2,, bp) is a list of
An information system is used for representing all attributes in Dom(r1) Dom(r2) A2 on which r1,
knowledge. Its definition, given here, is due to Pawlak r2 differ and r1(b1)= v1, r1(b2)= v2,, r1(bp)= vp,
(1991). r2(b1)= w1, r2(b2)= w2,, r2(bp)= wp.
By an information system we mean a pair S = (U, A), By (r1,r2)-action rule on xU we mean a statement:
where:
[ (b1, v1 w1) (b2, v2 w2) (bp, vp
1. U is a nonempty, finite set of objects (object wp)](x) [(d, k1 k2)](x).
identifiers),
2. A is a nonempty, finite set of attributes, that is, If the value of the rule on x is true then the rule is
a:U Va for a A, where Va is called the domain valid. Otherwise it is false.
of a. Let us denote by U<r1> the set of all customers in U
supporting the rule r1. If (r1,r2)-action rule is valid on
Information systems can be seen as decision tables. xU <r1> then we say that the action rule supports the new
In any decision table together with the set of attributes profit ranking k2 for x.
a partition of that set into conditions and decisions is To define an extended action rule (Ras & Tsay, 2003),
given. Additionally, we assume that the set of conditions let us assume that two rules are considered. We present
is partitioned into stable and flexible conditions (Ras &
Table 1.
Wieczorkowska, 2000).
A (St) B (Fl) C (St) E (Fl) G (St) H (Fl) D (Decision)
Attribute a A is called stable for the set U if its values
a1 b1 c1 e1 d1
assigned to objects from U can not change in time. a1 b2 g2 h2 d2

TEAM LinG
Action Rules

them in Table 1 to better clarify the process of construct- [(a, 1 2)](x) [(d, L H)](x) is supported by x1
ing extended action rules. Here, St means stable classi- and x4. A
fication attribute and Fl means flexible one.
In a classical representation, these two rules will have The confidence of an extended action rule is higher
a form: than the confidence of the corresponding action rule
because all objects making the confidence of that ac-
r1 = [ a1 * b1 * c1 * e1 d1 ] , tion rule lower have been removed from its set of
r2 = [ a1 * b2 * g2 * h2 d2 ]. support.

Assume now that object x supports rule r1, which Actions Rules Discovery in Distributed
means that it is classified as d1. In order to reclassify x Autonomous Information System
to class d2, we need to change its value B from b1 to b2
but also we have to require that G(x)=g2 and that the In Ras & Dardzinska (2002), the notion of a Distributed
value H for object x has to be changed to h2. This is the Autonomous Knowledge System (DAKS) framework
meaning of the extended (r1,r2)-action rule given below: was introduced. DAKS is seen as a collection of knowl-
edge systems where each knowledge system is initially
[(B, b1 b2) (G = g2) (H, h2)](x) (D, defined as an information system coupled with a set of
d1 d2)(x). rules (called a knowledge base) extracted from that
system. These rules are transferred between sites due
Assume now that by Sup(t) we mean the number of to the requests of a query answering system associated
tuples having property t . with the client site. Each rule transferred from one site
By the support of the extended (r1,r2)-action rule of DAKS to another remains at both sites.
(given above) we mean: Assume now that information system S represents
one of DAKS sites. If rules extracted from S = (U, A1 A2
Sup[(A=a1)*(B=b1)*(G=g2)]
{d}), describing values of attribute d in terms of
By the confidence of the extended (r1,r2)-action rule attributes from A 1 A2, do not lead to any useful action
(given above) we mean: rules (user is not willing to undertake any actions sug-
gested by rules), we may:
[Sup[(A=a1)*(B=b1)*(G=g2)*(D=d1)]/
Sup[(A=a1)*(B=b1)*(G=g2)]] 1) search for definitions of flexible attributes listed
in the classification parts of these rules in terms
[Sup[(A=a1)*(B=b2)*(C=c1)*(D=d2)]/ of other local flexible attributes (local mining
Sup[(A=a1)*(B=b2)*(C=c1)]]. for rules),
2) search for definitions of flexible attributes listed
To give another example of extended action rule, in the classification parts of these rules in terms
assume that S=(U,A1 A2 {d}) is a decision table repre- of flexible attributes from another site (mining
sented by Table 2. Assume that A1= {c, b}, A 2 = {a}. for rules at remote sites),
For instance, rules r1=[(a,1)*(b,1) (d,L)], r2=[(c,2) * 3) search for definitions of decision attributes of
(a,2) (d,H)] can be extracted from S, where U <r1> = {x1, these rules in terms of flexible attributes from
x4}. Extended (r1,r2)-action rule another site (mining for rules at remote sites).

[ (a, 1 2) (c = 2)](x) [(d, L H)](x) Another problem, which has to be taken into consid-
eration, is the semantics of attributes that are common
is only supported by object x1. The corresponding (r1,r2)- for a client site and some of the remote sites. This
action rule semantics may easily differ from site to site. Some-
time, such a difference in semantics can be repaired
quite easily. For instance, if Temperature in Celsius is
Table 2. used at one site and Temperature in Fahrenheit at the
other, a simple mapping will fix the problem. If infor-
c a b d
mation systems are complete and two attributes have
x1 2 1 1 L
x2 1 2 2 L
the same name and differ only in their granularity level,
x3 2 2 1 H a new hierarchical attribute can be formed to fix the
x4 1 1 1 L problem. If databases are incomplete, the problem is more

TEAM LinG
Action Rules

complex because of the number of options available to To give more formal definition of similarity, we assume
interpret incomplete values (including null vales). The that:
problem is especially difficult in a distributed framework
when chase techniques based on rules extracted at the (x,y) = [{(bi(x), bi(y)) : bi (A Am)}] / card(A
client and at remote sites are used by a client site to impute Am), where:
current values by values which are less incomplete. These
problems are presented and partial solutions given in Ras (bi(x), bi(y)) = 0, if bi(x) bi(y)
& Dardzinska (2002). (bi(x), bi(y)) = 1, if bi(x) = bi(y)
Now, let us assume that the action rule (bi(x), bi(y)) = 1/2, if either bi(x) is undefined.

r = [(b1, v1 w1) (b 2, v2 w2) (bp, vp Also, assume that


wp)](x) (d, k 1 k2)(x),
(x, SupSm(r1)) = max{(x, y) : y SupSm(r1)}, for
extracted from system S, does not provide any useful each x Sup S(r)
hint to a user for its actionability.
In this case we may look for a new action rule By the confidence of the action rule [r 1 r] we mean
(extracted either from S or from some of its remote
sites) [{(x, SupSm(r1)) : x SupS(r)} / card(Sup S(r))]
Conf(r 1) Conf(r)
r1 = [(bj1, vj1 wj1)(bj2, vj2 wj2)(bjq, vjq
wjq)](y) (bj, vj w j)(y) where Conf(r) is the confidence of the rule r in S and
Conf(r 1) is the confidence of the rule r1 in Sm.
which concatenated with r may provide better hint for its If we allow to concatenate action rules extracted
actionability. For simplicity reason, we assume that the from S with action rules extracted at other sites of
semantics and the granularity levels of all attributes DAKS, we are increasing the total number of generated
listed in both information systems are the same. action rules and the same our chance to find more
Concatenation [r1 r] is a new action rule (called suitable action rules for reclassifying objects in S is
global), defined as: also increased.

[(b1, v1 w 1) [(b1j, v1j w 1j) (bj2, vj2 wj2)


( b jq, vjq wjq)] ( bp, vp wp)](x) (d, FUTURE TRENDS
k1 k2)(x)
Business user may be either unable or unwilling to
where x is an object in S = (X,A,V). proceed with actions leading to desired reclassifica-
Some of the attributes in {bj1, bj2,, bjq} may not tions of objects. Undertaking the actions may be trivial,
belong to A. Also, the support of r1 is calculated in the feasible to an acceptable degree, or may be practically
information system from which r 1 was extracted. Let us very difficult. Therefore, the notion of a cost of an
denote that system by Sm = (Xm, Am, Vm) and the set of action rule is of very great importance. New strategies
objects in m supporting r1 by SupSm(r1). Assume that for discovering action rules of the lowest cost in DAKS,
SupS(r) is the set of objects in S supporting rule r. The based on ontologies, will be investigated.
domain of [r1 r] is the same as the domain of r, which is
equal to SupS(r). Before we define the notion of a similarity
between two objects belonging to two different informa- CONCLUSION
tion systems, we assume that A={b1, b2, b3, b4}, Am={b1,
b2, b3, b5, b6}, and objects xX, yXm are defined by Table Attributes are divided into two groups: stable and flex-
3 given below. ible. By stable attributes we mean attributes which val-
The similarity (x, y) between x and y is defined as: ues cannot be changed (for instance, age or maiden
[1+0+0+1/2+1/2+1/2]=5/12. name). On the other hand attributes (like percentage rate
or loan approval to buy a house) which values can be
changed are called flexible. Rules are extracted from a
Table 3.
decision table, using standard KD methods, with prefer-
b1 b2 b3 b4 b5 b6 ence given to flexible attributes so mainly they are
x v1 v2 v3 v4
y v1 w2 w3 w5 w6 listed in a classification part of rules. Most of these rules

TEAM LinG
Action Rules

can be seen as actionable rules and the same used to perimental and Theoretical Artificial Intelligence, Spe-
construct action-rules. cial Issue on Knowledge Discovery, 17(1-2), 119-128. A
Tzacheva, A., & Ras, Z.W. (2003). Discovering non-
standard semantics of semi-stable attributes. In I. Russell
REFERENCES & S. Haller (Eds.), Proceedings of FLAIRS-2003 (pp. 330-
334), St. Augustine, Florida. Menlo Park, CA: AAAI
Adomavicius, G., & Tuzhilin, A. (1997). Discovery of Press.
actionable patterns in databases: The action hierarchy
approach. In Proceedings of KDD97 Conference, New- Tzacheva, A., & Ras, Z.W. (2005). Action rules mining.
port Beach, CA. Menlo Park, CA: AAAI Press. International Journal of Intelligent Systems, Special
Issue on Knowledge Discovery (In press).
Geffner, H., & Wainer, J. (1998). Modeling action,
knowledge and control. In H. Prade (Ed.), ECAI 98, 13th
European Conference on AI (pp. 532-536). New York:
John Wiley & Sons. KEY TERMS
Liu, B., Hsu, W., & Chen, S. (1997). Using general
impressions to analyze discovered classification rules. Actionable Rule: A rule is actionable if user can do
In Proceedings of KDD97 Conference, Newport Beach, an action to his/her advantage based on this rule.
CA. Menlo Park, CA: AAAI Press. Autonomous Information System: Information
Pawlak, Z. (1991). Rough sets: Theoretical aspects of system existing as an independent entity.
reasoning about data. Kluwer. Domain of Rule: Attributes listed in the IF part of a
Ras, Z., & Dardzinska, A. (2002). Handling semantic rule.
inconsistencies in query answering based on distributed Flexible Attribute: Attribute is called flexible if
knowledge mining. In Foundations of Intelligent Sys- its value can be changed in time.
tems, Proceedings of ISMIS02 Symposium (pp. 66-
74). LNAI (No. 2366). Berlin: Springer-Verlag. Knowledge Base: A collection of rules defined as
expressions written in predicate calculus. These rules
Ras, Z., & Gupta, S. (2002). Global action rules in distrib- have a form of associations between conjuncts of values
uted knowledge systems. Fundamenta Informaticae Jour- of attributes.
nal, 51(1-2), 175-184.
Ontology: An explicit formal specification of how
Ras, Z., & Wieczorkowska, A. (2000). Action rules: to represent objects, concepts and other entities that are
How to increase profit of a company. In D.A. Zighed, J. assumed to exist in some area of interest and relation-
Komorowski, & J. Zytkow (Eds.), Principles of Data ships holding among them. Systems that share the same
Mining and Knowledge Discovery. (Proceedings of ontology are able to communicate about domain of
PKDD00 (pp. 587-592), LNAI (No. 1910), Lyon, France. discourse without necessarily operating on a globally
Berlin: Springer-Verlag. shared theory. System commits to ontology if its ob-
Silberschatz, A., & Tuzhilin, A. (1995). On subjective servable actions are consistent with definitions in the
measures of interestingness in knowledge discovery. In ontology.
Proceedings of KDD95 Conference. Menlo Park, CA: Semantics: The meaning of expressions written in
AAAI Press. some language, as opposed to their syntax, which de-
Silberschatz, A., & Tuzhilin, A. (1996). What makes scribes how symbols may be combined independently
patterns interesting in knowledge discovery systems. of their meaning.
IEEE Transactions on Knowledge and Data Engineer- Stable Attribute: Attribute is called stable for the
ing, 5(6). set U if its values assigned to objects from U cannot
Tsay, L.-S., & Ras, Z.W. (2005). Action rules discovery change in time.
system DEAR, method and experiments. Journal of Ex-

TEAM LinG
6

Active Disks for Data Mining


Alexander Thomasian
New Jersey Institute of Technology, USA

INTRODUCTION computers sharing a set of disks). Parallel data mining is


appropriate for large-scale data mining (Zaki, 1999) and
Active disks allow the downloading of certain types of the active disk paradigm can be considered as a low-cost
processing from the host computer onto disks, more scheme for parallel data mining.
specifically the disk controller, which has access to a In active disks the host computer partially or fully
limited amount of local memory serving as the disk downloads an application, such as data mining, onto the
cache. Data read from a disk may be sent directly to the microprocessors serving as the disk controllers. These
host computer as usual or processed locally at the disk. microprocessors have less computing power than those
Only filtered information is uploaded to the host com- associated with servers, but because servers tend to have
puter. In this article, We am interested in downloading a large number of disks, the raw computing power asso-
data mining and database applications to the disk control- ciated with the disk controllers may easily exceed the
ler. (raw) computing power of the server. Computing power
This article is organized as follows. A section on is estimated as the sum of the MIPS (millions of in-
background information is followed by a discussion of structions per second) ratings of the microprocessors.
data-mining operations and hardware technology trends. The computing power associated with the disks comes
The more influential active disk projects are reviewed in the form of a shared-nothing system, with connectiv-
next. Future trends and conclusions appear last. ity limited to indirect communication via the host.
Amdahls law on the efficiency of parallel processing is
applicable both to multiprocessors and disk control-
BACKGROUND lers: If a fraction F of the processing can be carried out
in parallel with a degree of parallelism P, then the
Data mining has been defined as the use of algorithms to effective degree of parallelism is 1/(1F+F/P)
extract information and patterns as part of Knowledge (Hennessy & Patterson, 2003). There is a higher degree
Discovery in Databases (KDD) (Dunham, 2003). Cer- of interprocessor communication cost for disk control-
tain aspects of data mining introduced a decade ago were lers.
computationally challenging for the computer systems A concise specification of active disks is as fol-
at that time. This was partially due to the high cost of lows:
computing and the uncertainty associated with the value
of information extracted by data mining. Data mining A number of important I/O intensive applications can
has become more viable economically with the advent take advantage of computational power available
of cheap computing power based on UNIX, Linux, and directly at storage devices to improve their overall
Windows operating systems. performance, more effectively balance their
The past decade has shown a tremendous increase in consumption of system wide resources, and provide
the processing power of computer systems, which has functionality that would not be otherwise available
made possible the previously impossible. On the other (Riedel, 1999).
hand, with higher disk transfer rates, a single processor
is required to process the incoming data from one disk, Active disks should be helpful from the following
but the number of disks associated with a multiproces- viewpoint. The size of the database and the processing
sor server usually exceeds the number of processors, requirements for decision support systems (DSS) is
that is, the data transfer rate is not matched by the growing rapidly. This is attributable to the length of the
processing power of the host computer. history, the level of the detail being saved, and the
Computer systems have been classified into shared- increased number of users and queries (Keeton,
everything systems (multiple processors sharing the Patterson, & Hellerstein, 1998). The first two factors
main memory and several disks), shared-nothing sys- contribute to the capacity requirement, while all three
tems (multiple computers with connectivity via an inter- factors contribute to the processing requirement.
connection network), and shared-disk systems (several With the advent of relational databases (in the mid-
1970s) perceived actual inefficiencies associated with

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Active Disks for Data Mining

the processing of (SQLstructured query language) objects is determined by the distance of their feature
queries, as compared to low-level data manipulation lan- vectors. k-NN queries find the k objects with the smallest A
guages (DMLs) for hierarchical and network databases, (squared) Euclidean distances.
led to numerous proposals for database machines (Stanley Clustering methods group items, but unlike classifica-
& Su, 1983). The following categorization is given: (a) tion, the groups are not predefined. A distance measure,
intelligent secondary storage devices, proposed to speed such as the Euclidean distance between feature vectors of
up text retrieval but later modified to handle relational the objects, is used to form the clusters. The agglomerative
algebra operators, (b) database filters to accelerate table clustering algorithm is a hierarchical algorithm, which
scan, such as content-addressable file store (CAFS) from starts with as many clusters as there are data items.
International Computers Limited (ICL), (c) associative Agglomerative clustering tends to be expensive.
memory systems, which retrieve data by content, and (d) Non-hierarchical or partitional algorithms compute
database computers, which are mainly multicomputers. the clusters more efficiently. The popular k-means clus-
Intelligent secondary storage devices can be further tering method is in the family of squared-error cluster-
classified (Riedel, 1999): (a) processor per track ing algorithms, which can be implemented as follows:
(PPT), (b) processor per head (PPH), and (c) proces- (1) Designate k randomly selected points from n points
sor per disk (PPD). Given that modern disks have as the centroids of the clusters; (2) assign a point to the
thousands of tracks, the first solution is out of the cluster, whose centroid is closest to it, based on Euclid-
question. The second solution may require the R/W ean or some other distance measure; (3) recompute the
heads to be aligned simultaneously to access all the centroids for all clusters based on the items assigned to
tracks on a cylinder, which is not feasible. NCRs them; (4) repeat steps (2 through 3) with the new cen-
Teradata DBC/1012 database machine (1985) is a troids until there is no change in point membership. One
multicomputer PPD system. measure of the quality of clustering is the sum of
To summarize, according to the active disk para- squared distances (SSD) of points in each cluster with
digm, the host computer offloads the processing of respect to its centroid. The algorithm may be applied
data-warehousing and data-mining operators onto the several times and the results of the iteration with the
embedded microprocessor controller in the disk drive. smallest SSD selected. Clustering of large disk-resi-
There is usually a cache associated with each disk drive, dent datasets is a challenging problem (Dunham, 2003).
which is used to hold prefetched data but can be also Association rule mining (ARM) considers market-
used as a small memory, as mentioned previously. basket or shopping-cart data, that is, the items purchased
on a particular visit to the supermarket. ARM first
determines the frequent sets, which have to meet a
MAIN THRUST certain support level. For example, s% support for two
items A and B, such as bread and butter, implies that they
Data mining, which requires high data access band- appear together in s percent of transactions. Another
widths and is computationally intensive, is used to illus- measure is the confidence level, which is the ratio of the
trate active disk applications. support for the set intersection of A and B divided by the
support for A by itself. If bread and butter appear to-
Data-Mining Applications gether in most market-basket transactions, then there is
high confidence that customers who buy bread also buy
The three main areas of data mining are (a) classifica- butter. On the other hand, this is meaningful only if a
tion, (b) clustering, and (c) association rule mining significant fraction of customers bought bread, that is,
(Dunham, 2003). A brief review is given of the methods the support level is high. Multiple passes over the data
discussed in this article. are required to find all association rules (with a lower
Classification assigns items to appropriate classes bound for the support) when the number of objects is
by using the attributes of each item. When regression is large.
used for this purpose, the input values are the item Algorithms to reduce the cost of ARM include sam-
attributes, and the output is its class. The k-nearest- pling, partitioning (the argument for why this works is
neighbor (k-NN) method uses a training set, and a new that a frequent set of items must be frequent in at least
item is placed in the set, whose entries appear most one partition), and parallel processing (Zaki, 1999).
among the k-NNs of the target item.
K-NN queries are also used in similarity search, for Hardware Technology Trends
example, content based image retrieval (CBIR). Ob-
jects (images) are represented by feature vectors in the A computer system, which may be a server, a worksta-
areas of object color, texture, and so forth. Similarity of tion, or a PC, has three components most affecting its

TEAM LinG
Active Disks for Data Mining

performance: one or more microprocessors, a main memory, Table scan is a costly operation in relational data-
and magnetic disk storage. My discussion of technology bases, which is applied selectively when the search
trends is based on Patterson and Keeton (1998). argument is not sargable, that is, not indexed. The data
The main memory consists of dynamic random access filtering for table scans can be carried out in parallel if the
memory (DRAM) chips. Memory chip capacity is qua- data is partitioned across several disks. A GROUP BY
drupled every three years so that the memory capacity SQL statement computes a certain value, such as the
versus cost ratio is increasing 25% per year. The memory minimum, the maximum, or the mean, based on some
access latency is about 150 nanoseconds and, in the case classification of the rows in the table under consider-
of RAMBUS DRAM, the bandwidth is 800 to 1600 MB per ation, for example, undergraduate standing. To compute
second. The access latency is dropping 7% per year, and the overall average GPA, each disk sends its mean GPA
the bandwidth is increasing 20% per year. According to and the number of participating records to the host
Moores law, processor speed increases 60% per year computer.
versus 7% for main memories, so that the processor- A table scan outperforms indexing in processing k-
memory performance gap grows 50% per year. Multilevel NN queries for high dimensions. There is the additional
cache memories are used to bridge this gap and reduce the cost of building and maintaining the index. A synthetic
number of processor cycles per instruction (Hennessy dataset associated with the IBM Almadens Quest
& Patterson, 2003). project for loan applications is used to experiment with
According to Greg Papadopulos (at Sun k-NN queries. The relational table contains the follow-
Microsystems), the demand for database applications ing attributes: age, education, salary, commission, zip
exceeds the increase in central processing unit (CPU) code, make of car, cost of house, loan amount, years
speed according to Moores law (Patterson & Keeton, owned. In the case of categorical attributes, an exact
1998). Both are a factor of two, but the former is in 9 to match is required.
12 months, and the latter is in 18 months. Consequently, The Apriori algorithm for ARM is also considered
the so-called database gap is increasing with time. in determining whether customers who purchase a par-
Disk capacity is increasing at the rate of 60% per ticular set of items also purchase an additional item,
year. This is due to dramatic increases in magnetic re- but this is meaningful at a certain level of support.
cording density. Disk access time consists of queueing More rules are generated for lower values of support,
time, controller time, seek time, rotational latency, and so the limited size of the disk cache may become a
transfer time. The increased disk RPMs (rotations per bottleneck.
minute) and especially increased linear recording densi- Analytical models and measurement results from a
ties have resulted in very high transfer rates, which are prototype show that active disks scale beyond the point
increasing 60% annually. The sum of seek time and where the server saturates.
rotational latency, referred to as positioning time, is of
the order of 10 milliseconds and is decreasing very Freeblock Scheduling
slowly (8% per year). Utilizing disk access bandwidth is
a very important consideration and is the motivation This CMU project emphasizes disk performance from
behind freeblock scheduling (Lumb, Schindler, Ganger, the viewpoint of maximizing disk arm utilization (Lumb
Nagle, & Riedel, 2000; Riedel, Faloutsos, Ganger, & et al., 2000). Freeblock scheduling utilizes opportu-
Nagle, 2000). nistic reading of low-priority blocks of data from disk,
while the arm having completed the processing of a
Active Disk Projects high priority request is moving to process another such
request. Opportunities for freeblock scheduling di-
There have been several concurrent activities in the area minish as more and more blocks are being read. This is
of active disks at Carnegie Mellon University, the Uni- because blocks located centrally will be accessed right
versity of California at Santa Barbara, University of away, although other disk blocks located at extreme
Maryland, University of California at Berkeley, and in disk cylinders may require an explicit access.
the SmartSTOR project at IBMs Almaden Research In one scenario a disk processes requests by an OLTP
Center. (online transaction processing) application and a back-
ground ARM application. OLTP requests have a higher
Active1 Disk Projects at CMU priority because transaction response time should be as
low as possible, but ARM requests are processed as
Most of this effort is summarized in (Riedel, Gibson, & freeblock requests. OLTP requests access specific
Faloutsos, 1998). records, but ARM requires multiple passes over the

TEAM LinG
Active Disks for Data Mining

dataset in any order. This is a common feature of algo- The host, which runs application programs, can offload
rithms suitable for freeblock scheduling. processing to SmartSTORs, which then deliver results A
In the experimental study, OLTP requests are gener- back to the host. Experimental results with the TPC-D
ated by transactions running at a given multiprogramming benchmark are presented (see http://www.tpc.org).
level (MPL) with a certain think time, that is, the time
before the transaction generates its next request. Re-
quests are to 8 KB blocks, and the Read:Write ratio is 2:1, FUTURE TRENDS
as in the TPC-C benchmark (see http://www.tpc.org),
while the background process accesses 4 KB blocks. The The offloading of activities from the host to peripheral
bandwidth due to freeblock requests increases with in- devices has been carried out successfully in the past.
creasing MPL. Initiating low-priority requests when the SmartSTOR is an intermediate step, and active disk is a
disk has a low utilization is not considered in this study, more distant possibility, which would require standard-
because such accesses would result in an increase in the ization activities such as object-based storage devices
response time of disk accesses on behalf of the OLTP (OSD) (http://www.snia.org).
application. Additional seeks are required to access the
remaining blocks so that the last 5% of requests takes 30%
of the time of a full scan. CONCLUSION
Active Disks at UCSB/Maryland The largest benefit stemming from active disks comes
from the parallel processing capability provided by a
The host computer acts as a coordinator, scheduler, and large number of disks, that the aggregate processing
combiner of results, while the bulk of processing is power of disk controllers may exceed the computing
carried out at the disks (Acharya, Uysal, & Saltz, 1998). power of servers.
The computation at the host initiates disklets at the The filtering effect is another benefit of active disks.
disks. Disklets are disallowed to initiate disk accesses Disk transfer rates are increasing rapidly, so that by
and to allocate and free memory in the disk cache. All eliminating unnecessary I/O transfers, more disks can
these functions are carried out by the host computers be placed on I/O buses or storage area networks
operating system (OS). Disklets can only access memory (SANs).
locations (in a disks cache) within certain bounds speci-
fied by the host computers OS.
Implemented are SELECT, GROUP BY, and ACKNOWLEDGMENT
DATACUBE operator, which computes GROUP BYs
for all possible combinations of a list of attributes Supported by NSF through Grant 0105485 in Computer
(Dunham, 2003), external sort, image convolution, and Systems Architecture.
generating composite satellite images.

The SmartSTOR Project2 REFERENCES


SmartSTOR is significantly different from the previous Acharya, A., Uysal, M., & Saltz, J. H. (1998). Active
projects. It is argued in Hsu, Smith, and Young (2000) disks: Programming model, algorithms, and evaluation.
that enhancing the computing power of individual disks Proceedings of the Eighth International Conference
is not economically viable due to stringent constraints on Architectural Support for Programming Languages
on the power budget and the cost of individual disks. The and Operating Systems (pp. 81-91), USA.
low market share for such high-cost, high-performance,
and high-functionality disks would further increase their Agarwal, S., Agrawal, R., Deshpande, P., Gupta, A.,
cost. Naughton, J., Ramakrishnan, R. et al. (1996). On the
The following alternatives are considered (Hsu et computation of multidimensional aggregates. Proceed-
al., 2000): (a) no offloading, (b) offloading operations ings of the 22nd International Conference on Very
on single tables, and (c) offloading operations on mul- Large Data Bases (pp. 506-521), India.
tiple tables. The third approach, which makes the most
sense, is implemented in SmartSTOR, which is the Dunham, M. H. (2003). Data mining: Introductory and
controller of multiple disks and can coordinate the advanced topics. Prentice-Hall.
joining of relational tables residing on the disks.

TEAM LinG
Active Disks for Data Mining

Hennessey, J. L., & Patterson, D. A. (2003). Computer Database Computer or Machine: A specialized com-
architecture: A quantitative approach (3rd ed.). Morgan puter for database applications, which usually works in
Kaufmann. conjunction with a host computer.
Hsu, W. W., Smith, A. J., & Young, H. (2000). Projecting Database Gap: Processing demand for database appli-
the performance of decision support workloads with smart cations is twofold in nine to 12 months, but it takes 18
storage (SmartSTOR). Proceedings of the Seventh Inter- months for the processor speed to increase that much
national Conference on Parallel and Distributed Sys- according to Moores law.
tems (pp. 417-425), Japan.
Disk Access Time: Sum of seek time (ST), rotational
Keeton, K., Patterson, D. A., & Hellerstein, J. M. (1998). latency (RL), and transfer time (TT). ST is the time to move
A case for intelligent disks (IDISKs). ACM SIGMOD the read/write heads (attached to the disk arm) to the
Record, 27(3), 42-52. appropriate concentric track on the disk. There is also a
head selection time to select the head on the appropriate
Lumb, C., Schindler, J., Ganger, G. R., Nagle, D. F., & track on the disk. RL for small block transfers is half of the
Riedel, E. (2000). Towards higher disk head utilization: disk rotation time. TT is the ratio of the block size and the
Extracting free bandwidth from busy disk drives. Pro- average disk transfer rate.
ceedings of the Fourth Symposium on Operating Sys-
tems Design and Implementation (pp. 87-102), USA. Freeblock Scheduling: A disk arm scheduling method
that uses opportunistic accesses to disk blocks required
Patterson, D. A., & Keeton, K. (1998). Hardware technol- for a low-priority activity.
ogy trends and database opportunities. Keynote address
of the ACM SIGMOD Conference on Management of Processor per Track/Head/Disk: The last organiza-
Data. Retrieved from http://www.cs.berkeley.edu/~pattr tion corresponds to active disks.
sn/talks.html
Shared Everything/Nothing/Disks System: The main
Ramakrishnan, R., & Gehrke, J. (2003). Database manage- memory and disks are shared by the (multiple) processors
ment systems (3rd ed.). McGraw-Hill. in the first case, nothing is shared in the second case (i.e.,
standalone computers connected via an interconnection
Riedel, E. (1999). Active disk Remote execution for network), and disks are shared by processor memory
network attached storage (Tech. Rep. No. CMU-CS- combinations in the third case.
99-177). CMU, Department of Computer Science.
SmartSTOR: A scheme where the disk array control-
Riedel, E., Faloutsos, C., Ganger, G. R., & Nagle, D. F. ler for multiple disks assists the host in processing data-
(2000). Data mining in an OLTP system (nearly) for base applications.
free. Proceedings of the ACM SIGMOD International
Conference on Management of Data (pp. 13-21), USA. Table Scan: The sequential reading of all the blocks
of a relational table to select a subset of its attributes
Riedel, E., Gibson, G. A., & Faloutsos, C. (1998). Active based on a selection argument, which is either not
storage for large scale data mining and multimedia appli- indexed (called a sargable argument) or the index is not
cations. Proceedings of the 24th International Very clustered.
Large Data Base Conference (pp. 62-73), USA.
Transaction Processing Council: This council
Su, W.-Y.S. (1983). Advanced database machine archi- has published numerous benchmarks for transaction
tecture. Prentice-Hall. processing (TPC-C), decision support (TPC-H and TPC-
Zaki, M. J. (1999). Parallel and distributed associative rule R), transactional Web benchmark, also supporting brows-
mining: A survey. IEEE Concurrency, 7(4), 14-25. ing (TPC-W).

KEY TERMS ENDNOTES

Active Disk: A disk whose controller runs applica-


1
There have been several concurrent activities in
tion code, which can process data on disk. the area of Active Disks: (a) Riedel, Gibson, and
Faloutsos at Carnegie-Mellon University (1998)
Content-Addressable File Store (CAFS): Specialized and Riedel (1999); (b) Acharya, Uysal, and Saltz
hardware from ICL (UKs International Computers Limited) (1998) at the University of California at Santa
used as a filter for database applications.

10

TEAM LinG
Active Disks for Data Mining

Barbara (UCSB) and University of Maryland; (c) PipeHash algorithm represents the datacube as a
intelligent disks by Keeton, Patterson, and lattice of related GROUP BYs. A directed edge A
Hellerstein (1998) at the University of California connects a GROUP BY i to a GROUP BY j, if j can
(UC) at Berkeley; and (d) the SmartSTOR project at be generated by i and has one less attribute
UC Berkeley and the IBM Almaden Research Center (Agarwal, Agrawal, Deshpande, Gupta, Naughton,
by Hsu, Smith, and Young (2000). Ramakrishnan, et al., 1996). Three other applica-
2
The SQL SELECT and GROUP BY statements are tions dealing with an external sort, image convolu-
easy to implement. The datacube operator com- tion, and generating composite satellite images
putes GROUP BYs for all possible combinations are beyond the scope of this discussion.
of a list of attributes (Dunham, 2003). The

11

TEAM LinG
12

Active Learning with Multiple Views


Ion Muslea
SRI International, USA

INTRODUCTION they bootstrap the views from each other by augmenting


the training set with unlabeled examples on which the
Inductive learning algorithms typically use a set of other views make high-confidence predictions. Such
labeled examples to learn class descriptions for a set of algorithms improve the classifiers learned from labeled
user-specified concepts of interest. In practice, label- data by also exploiting the implicit information pro-
ing the training examples is a tedious, time consuming, vided by the distribution of the unlabeled examples.
error-prone process. Furthermore, in some applica- In contrast to semi-supervised learning, active learn-
tions, the labeling of each example also may be ex- ers (Tong & Koller, 2001) typically detect and ask the
tremely expensive (e.g., it may require running costly user to label only the most informative examples in the
laboratory tests). In order to reduce the number of domain, thus reducing the users data-labeling burden.
labeled examples that are required for learning the Note that active and semi-supervised learners take dif-
concepts of interest, researchers proposed a variety of ferent approaches to reducing the need for labeled data;
methods, such as active learning, semi-supervised learn- the former explicitly search for a minimal set of labeled
ing, and meta-learning. examples from which to perfectly learn the target con-
This article presents recent advances in reducing the cept, while the latter aim to improve a classifier learned
need for labeled data in multi-view learning tasks; that from a (small) set of labeled examples by exploiting
is, in domains in which there are several disjoint subsets some additional unlabeled data.
of features (views), each of which is sufficient to learn In keeping with the active learning approach, this
the target concepts. For instance, as described in Blum article focuses on minimizing the amount of labeled
and Mitchell (1998), one can classify segments of data without sacrificing the accuracy of the learned
televised broadcast based either on the video or on the classifiers. We begin by analyzing co-testing (Muslea,
audio information; or one can classify Web pages based 2002), which is a novel approach to active learning. Co-
on the words that appear either in the pages or in the testing is a multi-view active learner that maximizes the
hyperlinks pointing to them. In summary, this article benefits of labeled training data by providing a prin-
focuses on using multiple views for active learning and cipled way to detect the most informative examples in a
improving multi-view active learners by using semi- domain, thus allowing the user to label only these.
supervised- and meta-learning. Then, we discuss two extensions of co-testing that
cope with its main limitationsthe inability to exploit
the unlabeled examples that were not queried and the
BACKGROUND lack of a criterion for deciding whether a task is appro-
priate for multi-view learning. To address the former,
we present Co-EMT (Muslea et al., 2002a), which inter-
Active, Semi-Supervised, and leaves co-testing with a semi-supervised, multi-view
Multi-view Learning learner. This hybrid algorithm combines the benefits of
active and semi-supervised learning by detecting the
Most of the research on multi-view learning focuses on most informative examples, while also exploiting the
semi-supervised learning techniques (Collins & Singer, remaining unlabeled examples. Second, we discuss Adap-
1999, Pierce & Cardie, 2001) (i.e., learning concepts tive View Validation (Muslea et al., 2002b), which is a
from a few labeled and many unlabeled examples). By meta-learner that uses the experience acquired while
themselves, the unlabeled examples do not provide any solving past learning tasks to predict whether multi-
direct information about the concepts to be learned. How- view learning is appropriate for a new, unseen task.
ever, as shown by Nigam, et al. (2000) and Raskutti, et al.
(2002), their distribution can be used to boost the accuracy A Motivating Problem: Wrapper
of a classifier learned from the few labeled examples.
Intuitively, semi-supervised, multi-view algorithms
Induction
proceed as follows: first, they use the small labeled
training set to learn one classifier in each view; then, Information agents such as Ariadne (Knoblock et al.,
2001) integrate data from pre-specified sets of Web
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Active Learning with Multiple Views

sites so that they can be accessed and combined via Figure 1. An information agent that combines data
database-like queries. For example, consider the agent in from the Zagats restaurant guide, the L.A. County A
Figure 1, which answers queries such as the following: Health Department, the ETAK Geocoder, and the Tiger
Map service
Show me the locations of all Thai restaurants in L.A. Restaurant Guide
that are A-rated by the L.A. County Health Department.
L.A. County Query:
Health Dept. A-rated Thai
To answer this query, the agent must combine data
restaurants
from several Web sources: in L.A.

from Zagats, it obtains the name and address of all


Thai restaurants in L.A.;
Agent
from the L.A. County Web site, it gets the health
rating of any restaurant of interest; RESULTS:

from the Geocoder, it obtains the latitude/longi-


tude of any physical address; Geocoder
from Tiger Map, it obtains the plot of any location,
given its latitude and longitude.
Tiger Map Server
Information agents typically rely on wrappers to
extract the useful information from the relevant Web
pages. Each wrapper consists of a set of extraction rules MAIN THRUST
and the code required to apply them. As manually writing
the extraction rules is a time-consuming task that re-
In the context of wrapper induction, we intuitively de-
quires a high level of expertise, researchers designed
scribe three novel algorithms: Co-Testing, Co-EMT,
wrapper induction algorithms that learn the rules from
and Adaptive View Validation. Note that these algo-
user-provided examples (Muslea et al., 2001).
rithms are not specific to wrapper induction, and they
In practice, information agents use hundreds of ex-
have been applied to a variety of domains, such as text
traction rules that have to be updated whenever the
classification, advertisement removal, and discourse
format of the Web sites changes. As manually labeling
tree parsing (Muslea, 2002).
examples for each rule is a tedious, error-prone task,
one must learn high accuracy rules from just a few
labeled examples. Note that both the small training sets Co-Testing: Multi-View Active Learning
and the high accuracy rules are crucial to the successful
deployment of an agent. The former minimizes the Co-Testing (Muslea, 2002, Muslea et al., 2000), which
amount of work required to create the agent, thus mak- is the first multi-view approach to active learning, works
ing the task manageable. The latter is required in order as follows:
to ensure the quality of the agents answer to each query:
when the data from multiple sources is integrated, the first, it uses a small set of labeled examples to
errors of the corresponding extraction rules get com- learn one classifier in each view;
pounded, thus affecting the quality of the final result; then, it applies the learned classifiers to all unla-
for instance, if only 90% of the Thai restaurants and beled examples and asks the user to label one of
90% of their health ratings are extracted correctly, the the examples on which the views predict different
result contains only 81% (90% x 90% = 81%) of the A- labels;
rated Thai restaurants. it adds the newly labeled example to the training
We use wrapper induction as the motivating problem set and repeats the whole process.
for this article because, despite the practical impor-
tance of learning accurate wrappers from just a few Intuitively, Co-Testing relies on the following ob-
labeled examples, there has been little work on active servation: if the classifiers learned in each view predict
learning for this task. Furthermore, as explained in a different label for an unlabeled example, at least one
Muslea (2002), existing general-purpose active learn- of them makes a mistake on that prediction. By asking
ers cannot be applied in a straightforward manner to the user to label such an example, Co-Testing is guaran-
wrapper induction. teed to provide useful information for the view that
made the mistake.

13

TEAM LinG
Active Learning with Multiple Views

To illustrate Co-Testing for wrapper induction, con- iterative, two-step process: first, it uses the hypotheses
sider the task of extracting restaurant phone numbers from learned in each view to probabilistically label all the
documents similar to the one shown in Figure 2. To extract unlabeled examples; then it learns a new hypothesis in
this information, the wrapper must detect both the begin- each view by training on the probabilistically labeled
ning and the end of the phone number. For instance, to find examples provided by the other view.
where the phone number begins, one can use the following By interleaving active and semi-supervised learn-
rule: ing, Co-EMT creates a powerful synergy. On one hand,
Co-Testing boosts Co-EMs performance by providing
R1 = SkipTo( Phone:<i> ) it with highly informative labeled examples (instead of
random ones). On the other hand, Co-EM provides Co-
This rule is applied forward, from the beginning of Testing with more accurate classifiers (learned from
the page, and it ignores everything until it finds the string both labeled and unlabeled data), thus allowing Co-
Phone:<i>. Note that this is not the only way to detect Testing to make more informative queries.
where the phone number begins. An alternative way to Co-EMT was not yet applied to wrapper induction,
perform this task is to use the following rule: because the existing algorithms are not probabilistic
learners; however, an algorithm similar to Co-EMT was
R2 = BackTo( Cuisine ) BackTo( ( Number ) ) applied to information extraction from free text (Jones
et al., 2003). To illustrate how Co-EMT works, we
which is applied backward, from the end of the document. describe now the generic algorithm Co-EMTWI, which
R2 ignores everything until it finds Cuisine and then, combines Co-Testing with the semi-supervised wrap-
again, skips to the first number between parentheses. per induction algorithm described next.
Note that R1 and R2 represent descriptions of the In order to perform semi-supervised wrapper in-
same concept (i.e., beginning of phone number) that are duction, one can exploit a third view, which is used to
learned in two different views (see Muslea et al. [2001] evaluate the confidence of each extraction. This new
for details on learning forward and backward rules). That content-based view (Muslea et al., 2003) describes the
is, views V1 and V2 consist of the sequences of charac- actual item to be extracted. For example, in the phone
ters that precede and follow the beginning of the item, numbers extraction task, one can use the labeled ex-
respectively. View V1 is called the forward view, while amples to learn a simple grammar that describes the
V2 is the backward view. Based on V1 and V2, Co-Testing field content: (Number) Number Number. Similarly,
can be applied in a straightforward manner to wrapper when extracting URLs, one can learn that a typical URL
induction. As shown in Muslea (2002), Co-Testing clearly starts with the string http://www., ends with the string
outperforms existing state-of-the-art algorithms, both .html, and contains no HTML tags.
on wrapper induction and a variety of other real world Based on the forward, backward, and content-based
domains. views, one can implement the following semi-super-
vised wrapper induction algorithm. First, the small set
Co-EMT: Interleaving Active and of labeled examples is used to learn a hypothesis in
Semi-Supervised Learning each view. Then, the forward and backward views feed
each other with unlabeled examples on which they
To further reduce the need for labeled data, Co-EMT make high-confidence extractions (i.e., strings that are
(Muslea et al., 2002a) combines active and semi-super- extracted by either the forward or the backward rule and
vised learning by interleaving Co-Testing with Co-EM are also compliant with the grammar learned in the
(Nigam & Ghani, 2000). Co-EM, which is a semi-super- third, content-based view).
vised, multi-view learner, can be seen as the following Given the previous Co-Testing and the semi-super-
vised learner, Co-EMTWI combines them as follows. First,
the sets of labeled and unlabeled examples are used for
semi-supervised learning. Second, the extraction rules
Figure 2. The forward rule R1 and the backward rule that are learned in the previous step are used for Co-
R2 detect the beginning of the phone number. Forward Testing. After making a query, the newly labeled example
and backward rules have the same semantics and differ is added to the training set, and the whole process is
only in terms of from where they are applied (start/end repeated for a number of iterations. The empirical study in
of the document) and in which direction Muslea, et al., (2002a) shows that, for a large variety of
R1: SkipTo( Phone : <i> ) R2: BackTo(Cuisine) BackTo( (Number) ) text classification tasks, Co-EMT outperforms both Co-
Testing and the three state-of-the-art semi-supervised
Name: <i>Ginos </i> <p>Phone :<i> (800)111-1717 </i> <p> Cuisine :
learners considered in that comparison.

14

TEAM LinG
Active Learning with Multiple Views

View Validation: Are the Views FUTURE TRENDS


Adequate for Multi-View Learning? A
There are several major areas of future work in the field
The problem of view validation is defined as follows: of multi-view learning. First, there is a need for a view
given a new unseen multi-view learning task, how does a detection algorithm that automatically partitions a
user choose between solving it with a multi- or a single- domains features in views that are adequate for multi-
view algorithm? In other words, how does one know view learning. Such an algorithm would remove the last
whether multi-view learning will outperform pooling all stumbling block against the wide applicability of multi-
features together and applying a single-view learner? view learning (i.e., the requirement that the user pro-
Note that this question must be answered while having vides the views to be used). Second, in order to reduce
access to just a few labeled and many unlabeled ex- the computational costs of active learning (re-training
amples: applying both the single- and multi-view active after each query is CPU-intensive), one must consider
learners and comparing their relative performances is a look-ahead strategies that detect and propose (near)
self-defeating strategy, because it doubles the amount optimal sets of queries. Finally, Adaptive View Valida-
of required labeled data (one must label the queries tion has the limitation that it must be trained separately
made by both algorithms). for each application domain (e.g., once for wrapper
The need for view validation is motivated by the induction, once for text classification, etc.). A major
following observation: while applying Co-Testing to improvement would be a domain-independent view
dozens of extraction tasks, Muslea et al. (2002b) no- validation algorithm that, once trained on a mixture of
ticed that the forward and backward views are appropri- tasks from various domains, can be applied to any new
ate for most, but not all, of these learning tasks. This learning task, independently of its application domain.
view adequacy issue is related tightly to the best extrac-
tion accuracy reachable in each view. Consider, for
example, an extraction task in which the forward and CONCLUSION
backward rules lead to a high- and low-accuracy rule,
respectively. Note that Co-Testing is not appropriate In this article, we focus on three recent developments
for solving such tasks; by definition, multi-view learn- that, in the context of multi-view learning, reduce the
ing applies only to tasks in which each view is sufficient need for labeled training data.
for learning the target concept (obviously, the low-
accuracy view is insufficient for accurate extraction). Co-Testing: A general-purpose, multi-view ac-
To cope with this problem, one can use Adaptive tive learner that outperforms existing approaches
View Validation (Muslea et al., 2002b), which is a meta- on a variety of real-world domains.
learner that uses the experience acquired while solving Co-EMT: A multi-view learner that obtains a ro-
past learning tasks to predict whether the views of a new bust behavior over a wide spectrum of learning
unseen task are adequate for multi-view learning. The tasks by interleaving active and semi-supervised
view validation algorithm takes as input several solved multi-view learning.
extraction tasks that are labeled by the user as having Adaptive View Validation: A meta-learner that
views that are adequate or inadequate for multi-view uses past experiences to predict whether multi-
learning. Then, it uses these solved extraction tasks to view learning is appropriate for a new unseen
learn a classifier that, for new unseen tasks, predicts learning task.
whether the views are adequate for multi-view learning.
The (meta-) features used for view validation are
properties of the hypotheses that, for each solved task, REFERENCES
are learned in each view (i.e., the percentage of unla-
beled examples on which the rules extract the same Blum, A., & Mitchell, T. (1998). Combining labeled and
string, the difference in the complexity of the forward unlabeled data with co-training. Proceedings of the
and backward rules, the difference in the errors made on Conference on Computational Learning Theory
the training set, etc.). For both wrapper induction and (COLT-1998).
text classification, Adaptive View Validation makes
accurate predictions based on a modest amount of train- Collins, M., & Singer, Y. (1999). Unsupervised models
ing data (Muslea et al., 2002b). for named entity classification. Empirical Methods in

15

TEAM LinG
Active Learning with Multiple Views

Natural Language Processing & Very Large Corpora Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000).
(pp. 100-110). Text classification from labeled and unlabeled docu-
ments using EM. Machine Learning, 39(2-3), 103-134.
Jones, R., Ghani, R., Mitchell, T., & Riloff, E. (2003).
Active learning for information extraction with mul- Pierce, D., & Cardie, C. (2001). Limitations of co-training
tiple view feature sets. Proceedings of the ECML-2003 for natural language learning from large datasets. Empiri-
Workshop on Adaptive Text Extraction and Mining. cal Methods in Natural Language Processing, 1-10.
Knoblock, C. et al. (2001). The Ariadne approach to Web- Raskutti, B., Ferra, H., & Kowalczyk, A. (2002). Using
based information integration. International Journal of unlabeled data for text classification through addition
Cooperative Information Sources, 10, 145-169. of cluster parameters. Proceedings of the Interna-
tional Conference on Machine Learning (ICML-2002).
Muslea, I. (2002). Active learning with multiple views
[doctoral thesis]. Los Angeles: Department of Com- Tong, S., & Koller, D. (2001). Support vector machine
puter Science, University of Southern California. active learning with applications to text classification.
Journal of Machine Learning Research, 2, 45-66.
Muslea, I., Minton, S., & Knoblock, C. (2000). Selective
sampling with redundant views. Proceedings of the Na-
tional Conference on Artificial Intelligence (AAAI-2000).
Muslea, I., Minton, S., & Knoblock, C. (2001). Hierar-
KEY TERMS
chical wrapper induction for semi-structured sources.
Journal of Autonomous Agents & Multi-Agent Sys- Active Learning: Detecting and asking the user to
tems, 4, 93-114. label only the most informative examples in the domain
(rather than randomly-chosen examples).
Muslea, I., Minton, S., & Knoblock, C. (2002a). Active
+ semi-supervised learning = robust multi-view learn- Inductive Learning: Acquiring concept descrip-
ing. Proceedings of the International Conference on tions from labeled examples.
Machine Learning (ICML-2002). Meta-Learning: Learning to predict the most ap-
Muslea, I., Minton, S., & Knoblock, C. (2002b). Adap- propriate algorithm for a particular task.
tive view validation: A first step towards automatic view Multi-View Learning: Explicitly exploiting sev-
detection. Proceedings of the International Confer- eral disjoint sets of features, each of which is sufficient
ence on Machine Learning (ICML-2002). to learn the target concept.
Muslea, I., Minton, S., & Knoblock, C. (2003). Active Semi-Supervised Learning: Learning from both
learning with strong and weak views: A case study on labeled and unlabeled data.
wrapper induction. Proceedings of the International
Joint Conference on Artificial Intelligence (IJCAI-2003). View Validation: Deciding whether a set of views
is appropriate for multi-view learning.
Nigam, K., & Ghani, R. (2000). Analyzing the effective-
ness and applicability of co-training. Proceedings of Wrapper Induction: Learning (highly accurate)
the Conference on Information and Knowledge Man- rules that extract data from a collection of documents
agement (CIKM-2000). that share a similar underlying structure.

16

TEAM LinG
17

Administering and Managing a Data A


Warehouse
James E. Yao
Montclair State University, USA

Chang Liu
Northern Illinois University, USA

Qiyang Chen
Montclair State University, USA

June Lu
University of Houston-Victoria, USA

INTRODUCTION tion, 2004). The fast growth of databases enables compa-


nies to capture and store a great deal of business opera-
As internal and external demands on information from tion data and other business-related data. The data that are
managers are increasing rapidly, especially the infor- stored in the databases, either historical or operational,
mation that is processed to serve managers specific have been considered corporate resources and an asset
needs, regular databases and decision support systems that must be managed and used effectively to serve the
(DSS) cannot provide the information needed. Data ware- corporate business for competitive advantages.
houses came into existence to meet these needs, consoli- A database is a computer structure that houses a self-
dating and integrating information from many internal and describing collection of related data (Kroenke, 2004;
external sources and arranging it in a meaningful format Rob & Coronel, 2004). This type of data is primitive,
for making accurate business decisions (Martin, 1997). In detailed, and used for day-to-day operation. The data in
the past five years, there has been a significant growth in a warehouse is derived, meaning it is integrated, sub-
data warehousing (Hoffer, Prescott, & McFadden, 2005). ject-oriented, time-variant, and nonvolatile (Inmon,
Correspondingly, this occurrence has brought up the 2002). A data warehouse is defined as an integrated
issue of data warehouse administration and management. decision support database whose content is derived
Data warehousing has been increasingly recognized as an from various operational databases (Hoffer, Prescott, &
effective tool for organizations to transform data into McFadden, 2005; Sen & Jacob, 1998). Often a data ware-
useful information for strategic decision-making. To house can be referred to as a multidimensional database
achieve competitive advantages via data warehousing, because each occurrence of the subject is referenced by
data warehouse management is crucial (Ma, Chou, & Yen, an occurrence of each of several dimensions or character-
2000). istics of the subject (Gillenson, 2005). Some multidimen-
sional databases operate on a technological foundation
optimal for slicing and dicing the data, where data can
BACKGROUND be thought of as existing in multidimensional cubes (Inmon,
2002). Regular databases load data in two-dimensional
Since the advent of computer storage technology and tables. A data warehouse can use OLAP (online analytical
higher level programming languages (Inmon, 2002), processing) to provide users with multidimensional views
organizations, especially larger organizations, have put of their data, which can be visually represented as a cube
enormous amount of investment in their information for three dimensions (Senn, 2004).
system infrastructures. In a 2003 IT spending survey, With the host of differences between a database for
45% of American company participants indicated that day-to-day operation and a data warehouse for support-
their 2003 IT purchasing budgets had increased com- ing management decision-making process, the adminis-
pared with their budgets in 2002. Among the respon- tration and management of a data warehouse is of course
dents, database applications ranked top in areas of tech- far from similar. For instance, a data warehouse team
nology being implemented or had been implemented, requires someone who does routine data extraction,
with 42% indicating a recent implementation (Informa- transformation, and loading (ETL) from operational data-

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Administering and Managing a Data Warehouse

bases into data warehouse databases. Thus the team Formulating Strategic Plans: Environmental fac-
requires a technical role called ETL Specialist. On the tors can be matched up against the strategic plan by
other hand, a data warehouse is intended to support the identifying current market positioning, financial
business decision-making process. Someone like a busi- goals, and opportunities.
ness analyst is also needed to ensure that business Determining Specific Objectives: Exploration ware-
information requirements are crossed to the data ware- house can be used to find patterns; if found, these
house development. Data in the data warehouse can be patterns are then compared with patterns discov-
very sensitive and cross functional areas, such as per- ered previously to optimize corporate objectives
sonal medical records and salary information. There- (Inmon, Terdeman, & Imhoff, 2000).
fore, a higher level of security on the data is needed.
Encrypting the sensitive data in data warehouse is a While managing a data warehouse for business strat-
potential solution. Issues as such in data warehouse egy, what needs to be taken into consideration is the
administration and management need to be defined and difference between companies. No one formula fits
discussed. every organization. Avoid using so called templates
from other companies. The data warehouse is used for
your companys competitive advantages. You need to
MAIN THRUST follow your companys user information requirements
for strategic advantages.
Data warehouse administration and management covers
a wide range of fields. This article focuses only on data Data Warehouse Development Cycle
warehouse and business strategy, data warehouse devel-
opment life cycle, data warehouse team, process man- Data warehouse system development phases are similar
agement, and security management to present the cur- to the phases in the systems development life cycle
rent concerns and issues in data warehouse administra- (SDLC) (Adelman & Rehm, 2003). However, Barker
tion and management. (1998) thinks that there are some differences between
the two due to the unique functional and operational
Data Warehouse and Business Strategy features of a data warehouse. As business and informa-
tion requirements change, new corporate information
Data is the blood of an organization. Without data, the models evolve and are synthesized into the data ware-
corporation has no idea where it stands and where it will house in the Synthesis of Model phase. These models
go (Ferdinandi, 1999, p. xi). With data warehousing, are then used to exploit the data warehouse in the
todays corporations can collect and house large vol- Exploit phase. The data warehouse is updated with new
umes of data. Does the size of data volume simply data using appropriate updating strategies and linked to
guarantee you a success in your business? Does it mean various data sources.
that the more data you have the more strategic advan- Inmon (2002) sees system development for data
tages you have over your competitors? Not necessarily. warehouse environment as almost exactly the opposite
There is no predetermined formula that can turn your of the traditional SDLC. He thinks that traditional SDLC
information into competitive advantages (Inmon, is concerned with and supports primarily the opera-
Terdeman, & Imhoff, 2000). Thus, top management and tional environment. The data warehouse operates under
data administration team are confronted with the ques- a very different life cycle called CLDS (the reverse of
tion of how to convert corporate information into com- the SDLC). The CLDS is a classic data-driven develop-
petitive advantages. ment life cycle, but the SDLC is a classic requirements-
A well-managed data warehouse can assist a corpora- driven development life cycle.
tion in its strategy to gain competitive advantages. This
can be achieved by using an exploration warehouse, The Data Warehouse Team
which is a direct product of data warehouse, to identify
environmental factors, formulate strategic plans, and Building a data warehouse is a large system develop-
determine business specific objectives: ment process. Participants of data warehouse develop-
ment can range from a data warehouse administrator
Identifying Environmental Factors: Quantified (DWA) (Hoffer, Prescott, & McFadden, 2005) to a
analysis can be used for identifying a corporations business analyst (Ferdinandi, 1999). The data ware-
products and services, market share of specific house team is supposed to lead the organization into
products and services, financial management. assuming their roles and thereby bringing about a part-

18

TEAM LinG
Administering and Managing a Data Warehouse

nership with the business (McKnight, 2000). A data ware- a new data warehouse administrator is required for
house team may have the following roles (Barker, 1998; each year a data warehouse is up and running and A
Ferdinandi, 1999; Inmon, 2000, 2003; McKnight, 2000): is being used successfully;
if an ETL tool is being written manually, many data
Data Warehouse Administrator (DWA): respon- warehouse administrators are needed; if automa-
sible for integrating and coordinating of metadata tion tool is needed much fewer staffing is required;
and data across many different data sources as well automated data warehouse database management
as data source management, physical database de- system (DBMS) requires fewer data warehouse
sign, operation, backup and recovery, security, and administrators, otherwise more administrators
performance and tuning. are needed;
Manager/Director: responsible for the overall fewer supporting staff is required if the corporate
management of the entire team to ensure that the information factory (CIF) architecture is fol-
team follows the guiding principles, business re- lowed more closely; reversely, more staff is
quirements, and corporate strategic plans. needed.
Project Manager: responsible for data warehouse
project development, including matching each team McKnight (2000) suggests that all the technical
members skills and aspirations to tasks on the roles be performed full-time by dedicated personnel
project plan. and each responsible person receives specific data
Executive Sponsor: responsible for garnering and warehouse training.
retaining adequate resources for the construction Data warehousing is growing rapidly. As the scope
and maintenance of the data warehouse. and data storage size of the data warehouse change, the
Business Analyst: responsible for determining roles and size of a data warehouse team should be
what information is required from a data warehouse adjusted accordingly. In general, the extremes should
to manage the business competitively. be avoided. Without sufficient professionals, job may
System Architect: responsible for developing and not be done satisfactorily. On the other hand, too many
implementing the overall technical architecture of people will certainly get the team overstuffed.
the data warehouse, from the backend hardware and
software to the client desktop configurations. Process Management
ETL Specialist: responsible for routine work on
data extraction, transformation, and loading for the Developing data warehouse has become a popular but
warehouse databases. exceedingly demanding and costly activity in informa-
Front End Developer: responsible for develop- tion systems development and management. Data ware-
ing the front-end, whether it is client-server or house vendors are competing intensively for their cus-
over the Web. tomers because so much of their money and prestige
OLAP Specialist: responsible for the develop- are at stake. Consulting vendors have redirected their
ment of data cubes, a multidimensional view of data attention toward this rapidly expanding market seg-
in OLAP. ment. User companies are facing with a serious ques-
Data Modeler: responsible for modeling the ex- tion on which product they should buy. Sen & Jacobs
isting data in an organization into a schema that is (1998) advice is to first understand the process of data
appropriate for OLAP analysis. warehouse development before selecting the tools for
Trainer: responsible for training the end-users to its implementation. A data warehouse development
use the system so that they can benefit from the process refers to the activities required to build a data
data warehouse system. warehouse (Barquin, 1997). Sen & Jacob (1998) and
End User: responsible for providing feedback to Ma, Chou, & Yen (2000) have identified some of these
the data warehouse team. activities, which need to be managed during the data
warehouse development cycle: initializing project, es-
In terms of the size of the data warehouse administra- tablishing the technical environment, tool integration,
tor team, Inmon (2003) has several recommendations: determining scalability, developing an enterprise in-
formation architecture, designing the data warehouse
large warehouse requires more analysts; database, data extraction/transformation, managing
every 100gbs of data in a data warehouse requires metadata, developing the end-user interface, managing
another data warehouse administrator; the production environment, managing decision sup-
port tools and applications, and developing warehouse
roll-out.

19

TEAM LinG
Administering and Managing a Data Warehouse

As mentioned before, data warehouse development is transmission process. Even if unauthorized access occurs
a large system development process. Process manage- during transmission, there is no harm to the encrypted data
ment is not required in every step of the development unless the user has the decryption code (Ma, Chou, & Yen,
processes. Devlin (1997) states that process management 2000).
is required in the following areas: process schedule,
which consists of a network of tasks and decision points;
process map definition, which defines and maintains the FUTURE TRENDS
network of tasks and decision points that make up a
process; task initiation, which supports to initiate tasks Data warehousing administration and management is
on all of the hardware/software platforms in the entire data facing several challenges, as data warehousing becomes
warehouse environment; status information enquiry, a mature part of the infrastructure of organizations.
which enquires about the status of components that are More legislative work is necessary to protect individual
running on all platforms. privacy from abuse by government or commercial enti-
ties that have large volumes of data concerning those
Security Management individuals. The protection also calls for tightened se-
curity through technology as well as user efforts for
In recent years, information technology (IT) security workable rules and regulations while at the same time
has become one of the hottest and most important topics still granting a data warehouse the ability to perform
facing both users and providers (Senn, 2005). The goal large datasets for meaningful analyses (Marakas, 2003).
of database security is the protection of data from Todays data warehouse is limited to storage of
accidental or intentional threats to its integrity and structured data in the form of records, fields, and data-
access (Hoffer, Prescott, & McFadden, 2005). The bases. Unstructured data, such as multimedia, maps,
same is true for a data warehouse. However, higher graphs, pictures, sound, and video files are demanded
security methods, in addition to the common practices increasingly in organizations. How to manage the stor-
such as view-based control, integrity control, process- age and retrieval of unstructured data and how to search
ing rights, and DBMS security, need to be used for the for specific data items set a real challenge for data
data warehouse due to the differences between a data- warehouse administration and management. Alternative
base and data warehouse. One of the differences that storage, especially the near-line storage, which is one
demand a higher level of security for a data warehouse of the two forms of alternative storage, is considered to
is the scope of and detail level of data in the data be one of the best future solutions for managing the
warehouse, such as financial transactions, personal storage and retrieval of unstructured data in data ware-
medical records, and salary information. A method that houses (Marakas, 2003).
can be used to protect data that requires high level of The past decade has seen a fast rise of the Internet
security in a data warehouse is by using encryption and and World Wide Web. Today, Web-enabled versions of
decryption. all leading vendors warehouse tools are becoming avail-
Confidential and sensitive data can be stored in a able (Moeller, 2001). This recent growth in Web use
separate set of tables where only authorized users can and advances in e-business applications have pushed the
have access. These data can be encrypted while they are data warehouse from the back office, where it is ac-
being written into the data warehouse. In this way, the cessed by only a few business analysts, to the front lines
data captured and stored in the data warehouse are of the organization, where all employees and every
secure and can only be accessed on an authorized basis. customer can use it.
Three levels of security can be offered by using encryp- To accommodate this move to the frontline of the
tion and decryption. The first level is that only authorized organization, the data warehouse demands massive
users can have access to the data in the data warehouse. scalability for data volume as well as for performance.
Each group of users, internal or external, ranging from As the number of and types of users increase rapidly,
executives to information consumers should be granted enterprise data volume is doubling in size every 9 to 12
different rights for security reasons. Unauthorized users months. Around-the-clock access to the data warehouse
are totally prevented from seeing the data in the data is becoming the norm. The data warehouse will require
warehouse. The second level is the protection from unau- fast implementation, continuous scalability, and ease of
thorized dumping and interpretation of data. Without the management (Marakas, 2003).
right key an unauthorized access will not be allowed to Additionally, building distributed warehouses, which
write anything into the tables. On the other hand, the are normally called data marts, will be on the rise. Other
existing data in the tables cannot be decrypted. The third technical advances in data warehousing will include an
level is the protection from unauthorized access during the increasing ability to exploit parallel processing, auto-

20

TEAM LinG
Administering and Managing a Data Warehouse

mated information delivery, greater support of object Information Technology Toolbox. (2004). 2003 IToolbox
extensions, very large database support, and user-friendly spending survey. Retrieved from http://datawarehouse. A
Web-enabled analysis applications. These capabilities ittoolbox.com/research/survey.asp
should make data warehouses of the future more powerful
and easier to use, which will further increase the impor- Inmon, W.H. (2002). Building the data warehouse (3rd
tance of data warehouse technology for business strate- ed.). New York: John Wiley & Sons Inc.
gic decision making and competitive advantages (Ma, Inmon, W.H. (2000). Building the data warehouse: Get-
Chou, & Yen, 2000; Marakas, 2003; Pace University, 2004). ting started. Retrieved from http://www.billinmon.com/
library/whiteprs/earlywp/ttbuild.pdf

CONCLUSION Inmon, W.H. (2003). Data warehouse administration.


Retrieved from http://www.billinmon.com/library/other/
The data that organizations have captured and stored are dwadmin.asp
considered organizational assets. Yet the data themselves Inmon, W.H., Terdeman, R.H., & Imhoff, C. (2000). Explo-
cannot do anything until they are put into intelligent use. ration warehousing. New York: John Wiley & Sons Inc.
One way to accomplish this goal is to use data warehouse
and data mining technology to transform corporate infor- Kroenke, D.M. (2004). Database processing: Fundamen-
mation into business competitive advantages. tals, design, and implementation (9th ed.). Upper Saddle
What impacts data warehouses the most is the River, NJ: Prentice Hall.
Internet and Web technology. Web browser will be- Ma, C., Chou, D.V., & Yen, D.C. (2000). Data warehousing,
come the universal interface for corporations, allowing technology assessment and management. Industrial
employees to browse their data warehouse worldwide Management + Data Systems, 100 (3), 125-137.
on public and private networks, eliminating the need to
replicate data across diverse geographic locations. Thus Marakas, G.M. (2003). Modern data warehousing, min-
strong data warehouse management sponsorship and an ing, and visualization: Core concepts. Upper Saddle
effective administration team may become a crucial River, NJ: Prentice Hall.
factor to provide an organization with the information
service needed. Martin, J. (1997, September). New tools for decision
making. DM Review, 7, 80.
McKnight Associates, Inc. (2000). Effective data ware-
REFERENCES house organizational roles and responsibilities.
Sunnyvale, CA.
Adelman, S., & Relm, C. (2003, November 5). What are
the various phases in implementing a data warehouse Moeller, R.A. (2001). Distributed data warehousing us-
solution? DMReview. Retrieved from http://www. ing web technology: How to build a more cost-effective
dmreview.com/article_sub.cfm?articleId=7660 and flexible warehouse. New York: AMACOM American
Management Association.
Barker, R. (1998, February). Managing a data warehouse.
Chertsey, UK: Veritas Software Corporation. Pace University. (2004). Emerging technology. Retrieved
from http://webcomposer.pace.edu/ea10931w/Tappert/
Barquin, F. (1997). Building, using, and managing the data Assignment2.htm
warehouse. Upper Saddle River, NJ: Prentice Hall.
Post, G.V. (2005). Database management systems: design-
Devlin, B. (1997). Data warehouse: From architecture to ing & building business applications (3 rd ed.). New York:
implementation. Reading, MA: Addison-Wesley. McGraw-Hill/Irwin.
Ferdinandi, P.L. (1999). Data warehouse advice for man- Rob, P., & Coronel, C. (2004). Database systems: Design,
agers. New York: AMACOM American Management implementation, and management (6th ed.). Boston, MA:
Association. Course Technology.
Gillenson, M.L. (2005). Fundamentals of database manage- Sen, A., & Jacob, V.S. (1998). Industrial strength data
ment systems. New York: John Wiley & Sons Inc. warehousing: Why process is so important and so often
ignored. Communication of the ACM, 41(9), 29-31.
Hoffer, J.A., Prescott, M.B., & McFadden, F.R. (2005).
Modern database management (7th ed.) Upper Saddle Senn, J.A. (2004). Information technology: Principles,
River, NJ: Prentice Hall. practices, opportunities (3rd ed.). Upper Saddle River, NJ:
Prentice Hall.
21

TEAM LinG
Administering and Managing a Data Warehouse

KEY TERMS Database Management System (DBMS): A set of


programs used to define, administer, and process the
Alternative Storage: An array of storage media that database and its applications.
consists of two forms of storage: near-line storage and/or Metadata: Data about data; data concerning the struc-
second storage. ture of data in a database stored in the data dictionary.
CLDS: The facetiously named system develop- Near-line Storage: Near-line storage is siloed tape
ment life cycle (SDLC) for analytical, DSS systems. storage where siloed cartridges of tape are archived,
CLDS is so named because in fact it is the reverse of the accessed, and managed robotically.
classical SDLC.
Online Analytical Process (OLAP): Decision
Corporate Information Factory (CIF): The cor- Support System (DSS) tools that uses multidimensional
porate information factory is a logical architecture with data analysis techniques to provide users with multidi-
a purpose of delivering business intelligence and busi- mensional views of their data.
ness management capabilities driven by data provided
from business operations. System Development Life Cycle (SDLC): The
methodology used by most organizations for develop-
Data Mart: A data warehouse that is limited in scope ing large information systems.
and facility, but for a restricted domain.

22

TEAM LinG
23

Agent-Based Mining of User Profiles for A


E-Services
Pasquale De Meo
Universit Mediterranea di Reggio Calabria, Italy

Giovanni Quattrone
Universit Mediterranea di Reggio Calabria, Italy

Giorgio Terracina
Universit della Calabria, Italy

Domenico Ursino
Universit Mediterranea di Reggio Calabria, Italy

INTRODUCTION in, without considering other ones somehow related to


those just provided, possibly interesting the user in the
An electronic service (e-service) can be defined as a future and what the user did not take into account in the past.
collection of network-resident software programs that In spite of present user profile managers, generally
collaborate for supporting users in both accessing and when accessing an e-service, a user must personally
selecting data and services of their interest present in a search the proposals of the users interest through it. As
provider site. Examples of e-services are e-commerce, an example, consider the bookstore section of Amazon;
e-learning, and e-government applications. E-services whenever a customer looks for a book of interest, the
are undoubtedly one of the engines presently supporting customer must carry out an autonomous personal search
the Internet revolution (Hull, Benedikt, Christophides of it throughout the pages of the site. We argue that, for
& Su, 2003). Indeed, nowadays, a large number and a improving the effectiveness of e-services, it is neces-
great variety of providers offer their services also or sary to increase the interaction between the provider
exclusively via the Internet. and the user on the one hand and to construct a rich
profile of the user, taking into account the users de-
sires, interests, and behavior, on the other hand.
BACKGROUND In addition, it is necessary to take into account a
further important factor. Nowadays, electronic and tele-
In spite of their spectacular development and present communications technology is rapidly evolving in such
relevance, e-services are yet to be considered a stable a way to allow cell phones, palmtops, and wireless PDAs
technology, and various improvements could be consid- to navigate on the Web. These mobile devices do not
ered for them. Many of the present suggestions for have the same display or bandwidth capabilities as their
bettering them are based on the concept of adaptivity desktop counterparts; nonetheless, present e-service
(i.e., the capability to make them more flexible in such providers deliver the same content to all device
a way so as to adapt their offers and behavior to the typologies (Communications of the ACM, 2002).
environment in which they are operating. In this context, In the past, various approaches have been proposed
systems capable of constructing, maintaining, and ex- for handling e-service activities; many of them are
ploiting profiles of users accessing e-services appear to agent-based. For example:
be capable of playing a key role in the future.
Both in the past and in the present, various e-service In Terziyan and Vitko (2002), an agent-based
providers exploit (usually rough) user profiles for pro- framework for managing commercial transactions
posing personalized offers. However, in most cases, the between a buyer and a seller is proposed. It ex-
profile construction methodology they adopt presents ploits a user profile that is handled by means of a
some problems. Indeed, it often requires a user to spend content-based policy.
a certain amount of time for constructing and updating In Garcia, Patern, and Gil (2002), a multi-agent
the users profile; in addition, it stores only information system called e-CoUSAL, capable of supporting
about the proposals that the user claims to be interested Web-shop activities, is presented. Its activity is

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Agent-Based Mining of User Profiles for E-Services

based on the maintenance and the exploitation of Samaras and Panayiotou (2002) present a flexible
user profiles. agent-based system for providing wireless users
In Lau, Hofstede, and Bruza (2000), WEBS, an with a personalized access to the Internet ser-
agent-based approach for supporting e-commerce vices.
activities, is proposed. It exploits probabilistic In Araniti, De Meo, Iera, and Ursino (2003), a
logic rules for allowing the customer preferences novel XML-based multi-agent system for QoS
for other products to be deduced. management in wireless networks is presented.
Ardissono, et al. (2001) describe SETA, a multi-
agent system conceived for developing adaptive These approaches are particularly general and inter-
Web stores. SETA uses knowledge representation esting; however, to the best of our knowledge, none of
techniques to construct, maintain, and exploit user them has been conceived for handling e-services.
profiles.
In Bradley and Smyth (2003), the system CASPER,
for handling recruitment services, is proposed. MAIN THRUST
Given a user, CASPER first ranks job advertise-
ments according to an applicants desires and then Challenges to Face
recommends job proposals to the applicant on the
basis of the applicants past behavior. In order to overcome the problems outlined previously,
In Razek, Frasson, and Kaltenbach (2002), a multi- some challenges must be tackled.
agent prototype for e-learning called CITS (Con- First, a user can access many e-services, operating in
fidence Intelligent Tutoring Agent) is proposed. the same or in different application contexts; a faithful
The approach of CITS aims at being adaptive and and complete profile of the user can be constructed only
dynamic. by taking into account the users behavior while access-
In Shang, Shi, and Chen (2001), IDEAL (Intelli- ing all the sites. In other words, it should be possible to
gent Distributed Environment for Active Learn- construct a unique structure on the user side, storing the
ing), a multi-agent system for active distance learn- users profile and, therefore, representing the users
ing, is proposed. In IDEAL, course materials are behavior while accessing all the sites.
decomposed into small components called Second, for a given user and e-service provider, it
lecturelets. These are XML documents containing should be possible to compare the profile of the user
JAVA code; they are dynamically assembled to with the offers of the provider for extracting those
cover course topics according to learner progress. proposals that probably will interest the user. Existing
In Zaiane (2002), an approach for exploiting Web- techniques for satisfying such a requirement are based
mining techniques to build a software agent sup- mainly on the exploitation of either log files or cookies.
porting e-learning activities is presented. Techniques based on log files can register only some
information about the actions carried out by the user
All these systems construct, maintain, and exploit a upon accessing an e-service; however, they cannot match
user profile; therefore, we can consider them adaptive user preferences and e-service proposals. Vice versa,
w.r.t. the user; however, to the best of our knowledge, techniques based on cookies are able to carry out a
none of them is adaptive w.r.t. the device. certain, even if primitive, match; however, they need to
On the other side, in various areas of computer know and exploit some personal information that a user
science research, a large variety of approaches adapting might consider private.
their behavior to the device the user is exploiting has Third, it should be necessary to overcome the typical
been proposed. As an example: one-size-fits-all philosophy of present e-service pro-
viders by developing systems capable of adapting their
In Anderson, Domingos, and Weld (2001), a frame- behavior to both the profile of the user and to the
work called MINPATH, capable of simplifying the characteristics of the device the user is exploiting for
browsing activity of a mobile user and taking into accessing them (Communications of the ACM, 2002).
account the device the user is exploiting, is pre-
sented.
In Macskassy, Dayanik, and Hirsh (2000), a frame-
System Description
work named i-Valets is proposed for allowing a
user to visit an information source by using differ- The system we present in this article (called e-service
ent devices. adaptive manager [ESA-Manager]) aims at solving all

24

TEAM LinG
Agent-Based Mining of User Profiles for E-Services

three problems mentioned previously. It is an XML- possibly interesting for the user in the future and that the
based multi-agent system for handling user accesses to user disregarded to take into account in the past (see De A
e-services, capable of adapting its behavior to both user Meo, Rosaci, Sarn, Terracina & Ursino [2003] for a
and device profiles. specialization of these algorithms to e-commerce). In
In ESA-Manager, a service provider agent is present our opinion, this is a particularly interesting feature for
for each e-service provider, handling the proposals stored a novel approach devoted to deal with e-services.
therein as well as the interaction with the user. In addi- Last, but not the least, it is worth observing that
tion, an agent is associated with each user, adapting its since the user profile management is carried out at the
behavior to the profiles of both the user and the device user side, no information about the user profile is sent
the user is exploiting for visiting the sites. Actually, to the e-service providers. In this way, ESA-Manager
since a user can access e-service providers by means of solves privacy problems left open by cookies.
different devices, the users profile cannot be stored in All the reasonings presented show that ESA-Man-
only one of them; as a matter of fact, it is necessary to ager is capable of solving also the second problem
have a unique copy of the user profile that registers the mentioned previously.
users behavior in visiting the e-service providers during In ESA-Manager, the device profile plays a central
the various sessions, possibly carried out by means of role. Indeed, the proposals of a provider shown to a
different devices. For this reason, the profile of a user user, as well as their presentation formats, depend on
must be handled and stored in a support different from the characteristics of the device the user is presently
the devices generally exploited by the user for accessing exploiting. However, the ESA-Manager capability of
e-service providers. As a consequence, on the user side, adapting its behavior to the device the user is exploiting
the exploitation of a profile agent appears compulsory, is not restricted to the presentation format of the
storing the profiles of both involved users and devices, proposals; indeed, the exploited device can influence
and a user-device agent, associated with a specific user also the computation of the interest degree shown by a
operating by means of a specific device, supporting the user for the proposals presented by each provider.
user in his or her activities. More specifically, one of the parameters that the
As previously pointed out, for each user, a unique interest degree associated with a proposal is based on,
profile is mined and maintained, storing information is the time the user spends visiting the corresponding
about the users behavior in accessing all e-service pro- Web pages. This time is not to be considered as an
viders1the techniques for mining, maintaining, and absolute measure, but it must be normalized w.r.t. both
exploiting user profiles are quite complex and slightly the characteristics of the exploited device and the
differ in the various applications domains; the interested navigation costs (Chan, 2000). The following example
reader can find examples of them, along with the corre- allows this intuition to be clarified. Assume that a user
sponding validation issues, in De Meo, Rosaci, Sarn, visits a Web page for two times and that each visit takes
Terracina, and Ursino (2003) for e-commerce and in De n seconds. Suppose, also, that during the first access,
Meo, Garro, Terracina, and Ursino (2003) for e-learn- the user exploits a mobile phone having a low proces-
ing. In this way, ESA-Manager solves the first problem sor clock and supporting a connection characterized by
mentioned previously. a low bandwidth and a high cost. During the second
Whenever a user accesses an e-service by means of a visit, the user uses a personal computer having a high
certain device, the corresponding service provider agent processor clock and supporting a connection charac-
sends information about its proposals to the user device terized by a high bandwidth and a low cost. It is possible
agent associated with the service provider agent and the to argue that the interest the user exhibited for the page
device he or she is exploiting. The user device agent in the former access is greater than what the user
determines similarities between the proposals presented exhibited in the latter one. Also, other device param-
by the provider and the interests of the user. For each of eters influence the behavior of ESA-Manager (see De
these similarities, both the service provider agent and Meo, Rosaci, Sarn, Terracina & Ursino [2003] for a
the user device agent cooperate for presenting to the detailed specification of the role of these parameters).
user a group of Web pages adapted to the exploited This reasoning allows us to argue that ESA-Manager
device, illustrating the proposal. solves also the third problem mentioned previously.
We argue that this behavior provides ESA-Manager As already pointed out, many agents are simulta-
with the capability of supporting the user in the search of neously active in ESA-Manager; they strongly interact
proposals of the users interest offered by the provider. with each other and continuously exchange informa-
In addition, the algorithms underlying ESA-Manager al- tion. In this scenario, an efficient management of in-
low it to identify not only the proposals probably inter- formation exchange appears crucial. One of the most
esting for the user in the present, but also other ones promising solutions to this problem has been the adop-

25

TEAM LinG
Agent-Based Mining of User Profiles for E-Services

tion of XML. XML capabilities make it particularly suited ploited for both storing the agent ontologies and for
to be exploited in the agent research. In ESA-Manager, the handling the agent communication.
role of XML is central; indeed, (1) the agent ontologies are As for future work, we argue that various improve-
stored as XML documents; (2) the agent communication ments could be performed on ESA-Manager for better-
language is ACML; (3) the extraction of information from ing its effectiveness and completeness. As an example,
the various data structures is carried out by means of it might be interesting to categorize involved users on
XQuery; and (4) the manipulation of agent ontologies is the basis of their profiles, as well as involved providers
performed by means of the Document Object Model on the basis of their proposals. As a further example of
(DOM). profitable features with which our system could be
enriched, we consider extremely promising the deriva-
tion of association rules representing and predicting the
FUTURE TRENDS user behavior on accessing one or more providers.
Finally, ESA-Manager could be made even more adap-
The spectacular growth of the Internet during the last tive by considering the possibility to adapt its behavior
decade has strongly conditioned the e-service land- on the basis not only of the device a user is exploiting
scape. Such a growth is particularly surprising in some during a certain access, but also of the context (e.g., job,
application domains, such as financial services or e- holidays) in which the user is currently operating.
government.
As an example, the Internet technology has enabled
the expansion of financial services by integrating the REFERENCES
already existing, quite variegate financial data and ser-
vices and by providing new channels for information Adaptive Web. (2002). Communications of the ACM,
delivery. For instance, in 2004, the number of house- 45(5).
holds in the U.S. that will use online banking is expected
to exceed approximately 24 million, nearly double the Anderson, C.R., Domingos, P., & Weld, D.S. (2001).
number of households at the end of 2000. Adaptive Web navigation for wireless devices. Pro-
Moreover, e-services are not a leading paradigm ceedings of the Seventeenth International Joint Con-
only in business contexts, but they are an emerging ference on Artificial Intelligence (IJCAI 2001), Se-
standard in several application domains. As an example, attle, Washington.
they are applied vigorously by governmental units at Araniti, G., De Meo, P., Iera, A., & Ursino, D. (2003).
national, regional, and local levels around the world. Adaptively controlling the QoS of multimedia wireless
Moreover, e-service technology is currently success- applications through user-profiling techniques. Jour-
fully exploited in some metropolitan networks for pro- nal of Selected Areas in Communications, 21(10),
viding mediation tools in a democratic system in order 1546-1556.
to make citizen participation in rule- and decision-
making processes more feasible and direct. These are Ardissono, L. et al. (2001). Agent technologies for the
only two examples of the role e-services can play in the development of adaptive Web stores. Agent Mediated
e-government context. Handling and managing this tech- Electronic Commerce, The European AgentLink Per-
nology in all these environments is one of the most spective (pp. 194-213). Lecture Notes in Computer Sci-
challenging issues for present and future researchers. ence, Springer.
Bradley, K., & Smyth, B. (2003). Personalized informa-
tion ordering: A case study in online recruitment. Knowl-
CONCLUSION edge-Based Systems, 16(5-6), 269-275.

In this article, we have proposed ESA-Manager, an XML- Chan, P.K. (2000). Constructing Web user profiles: A
based and adaptive multi-agent system for supporting a non-invasive learning approach. Web Usage Analysis
user accessing an e-service provider in the search of and User Profiling, 39-55. Springer.
proposals present therein and appearing to be appealing De Meo, P., Garro, A., Terracina, G., & Ursino, D.
according to the users past interests and behavior. (2003). X-Learn: An XML-based, multi-agent system
We have shown that ESA-Manager is adaptive w.r.t. for supporting user-device adaptive e-learning. Pro-
the profile of both the user and the device the user is ceedings of the International Conference on Ontolo-
exploiting for accessing the e-service provider. Finally, gies, Databases and Applications of Semantics
we have seen that it is XML-based, since XML is ex- (ODBASE 2003), Taormina, Italy.

26

TEAM LinG
Agent-Based Mining of User Profiles for E-Services

De Meo, P., Rosaci, D., Sarn G.M.L., Terracina, G., & tion Language defined by the Foundation for Intelligent
Ursino, D. (2003). An XML-based adaptive multi-agent Physical Agent (FIPA). A
system for handling e-commerce activities. Proceed-
ings of the International Conference on Web Ser- Adaptive System: A system adapting its behavior on
vicesEurope (ICWS-Europe 03), Erfurt, Germany. the basis of the environment it is operating in.

Garcia, F.J., Patern, F., & Gil, A.B. (2002). An adaptive Agent: A computational entity capable of both per-
e-commerce system definition. Proceedings of the ceiving dynamic changes in the environment it is oper-
International Conference on Adaptive Hypermedia ating in and autonomously performing user delegated
and Adaptive Web-Based Systems (AH02), Malaga, tasks, possibly by communicating and cooperating with
Spain. other similar entities.

Hull, R., Benedikt, M., Christophides, V., & Su, J. (2003). Agent Ontology: A description (like a formal speci-
E-services: A look behind the curtain. Proceedings of fication of a program) of the concepts and relationships
the Symposium on Principles of Database Systems that can exist for an agent or a community of agents.
(PODS 2003), San Diego, California. Device Profile: A model of a device storing infor-
Lau, R., Hofstede, A., & Bruza, P. (2000). Adaptive mation about both its costs and capabilities.
profiling agents for electronic commerce. Proceed- E-Service: A collection of network-resident soft-
ings of the CollECTeR Conference on Electronic Com- ware programs that collaborate for supporting users in
merce (CollECTeR 2000), Breckenridge, Colorado. both accessing and selecting data and services of their
Macskassy, S.A., Dayanik, A.A., & Hirsh, H. (2000). Infor- interest handled by a provider site. Examples of e-
mation valets for intelligent information access. Proceed- services are e-commerce, e-learning, and e-government
ings of the AAAI Spring Symposia Series on Adaptive applications.
User Interfaces, (AUI-2000), Stanford, California. eXtensible Markup Language (XML): The novel
Razek, M.A., Frasson, C., & Kaltenbach, M. (2002). language, standardized by the World Wide Web Consor-
Toward more effective intelligent distance learning tium, for representing, handling, and exchanging infor-
environments. Proceedings of the International Con- mation on the Web.
ference on Machine Learning and Applications Multi-Agent System (MAS): A loosely coupled
(ICMLA02), Las Vegas, Nevada. network of software agents that interact to solve prob-
Samaras, G., & Panayiotou, C. (2002). Personalized lems that are beyond the individual capacities or knowl-
portals for the wireless user based on mobile agents. edge of each of them. An MAS distributes computa-
Proceedings of the International Workshop on Mo- tional resources and capabilities across a network of
bile Commerce, Atlanta, Georgia. interconnected agents. The agent cooperation is handled
by means of an Agent Communication Language.
Shang, Y., Shi, H., & Chen, S. (2001). An intelligent
distributed environment for active learning. Proceed- User Modeling: The process of gathering informa-
ings of the ACM International Conference on World tion specific to each user either explicitly or implicitly.
Wide Web (WWW 2001), Hong Kong. This information is exploited in order to customize the
content and the structure of a service to the users
Terziyan, V., & Vitko, O. (2002). Intelligent informa- specific and individual needs.
tion management in mobile electronic commerce. Arti-
ficial Intelligence News, Journal of Russian Associa- User Profile: A model of a user representing both
tion of Artificial Intelligence, 5. the users preferences and behavior.

Zaiane, O.R. (2002). Building a recommender agent for


e-learning systems. Proceedings of the International
Conference on Computers in Education (ICCE 2002), ENDNOTE
Auckland, New Zealand.
1
It is worth pointing out that providers could be
either homogeneous (i.e., all of them operate in
the same application context, such as e-commerce)
KEY TERMS or heterogeneous (i.e., they operate in different
application contexts).
ACML: The XML encoding of the Agent Communica-

27

TEAM LinG
28

Aggregate Query Rewriting in Multidimensional


Databases
Leonardo Tininini
CNR - Istituto di Analisi dei Sistemi e Informatica Antonio Ruberti, Italy

INTRODUCTION Several groups are defined; each consisting of calls made


in the same region and with the same call plan, and finally
An efficient query engine is certainly one of the most applying the average aggregation function on the dura-
important components in data warehouses (also known as tion attribute of the data in each group. The pair of values
OLAP systems or multidimensional databases) and its (region, call plan) is used to identify each group and is
efficiency is influenced by many other aspects, both associated with the corresponding average duration value.
logical (data model, policy of view materialization, etc.) In multidimensional databases, the attributes used to
and physical (multidimensional or relational storage, in- group data define the dimensions, whereas the aggregate
dexes, etc). As is evident, OLAP queries are often based values define the measures.
on the usual metaphor of the data cube and the concepts The term multidimensional data comes from the well-
of facts, measures and dimensions and, in contrast to known metaphor of the data cube (Gray, Bosworth, Lay-
conventional transactional environments, they require man, & Pirahesh, 1996). For each of n attributes, used to
the classification and aggregation of enormous quantities identify a single measure, a dimension of an n-dimensional
of data. In spite of that, one of the fundamental require- space is considered. The possible values of the identify-
ments for these systems is the ability to perform multidi- ing attributes are mapped to points on the dimensions
mensional analyses in online response times. Since the axis, and each point of this n-dimensional space is thus
evaluation from scratch of a typical OLAP aggregate mapped to a single combination of the identifying at-
query may require several hours of computation, this can tribute values and hence to a single aggregate value. The
only be achieved by pre-computing several queries, stor- collection of all these points, along with all possible
ing the answers permanently in the database and then projections in lower dimensional spaces, constitutes the
reusing them in the query evaluation process. These pre- so-called data cube. In most cases, dimensions are struc-
computed queries are commonly referred to as material- tured in hierarchies, representing several granularity lev-
ized views and the problem of evaluating a query by using els of the corresponding measures (Jagadish, Lakshmanan,
(possibly only) these precomputed results is known as & Srivastava, 1999). Hence a time dimension can be
the problem of answering/rewriting queries using views. organized into days, months and years; a territorial dimen-
In this paper we briefly analyze the difference between sion into towns, regions and countries; a product dimen-
query answering and query rewriting approach and why sion into brands, families and types. When querying
query rewriting is preferable in a data warehouse context. multidimensional data, the user specifies the measures of
We also discuss the main techniques proposed in litera- interest and the level of detail required by indicating the
ture to rewrite aggregate multidimensional queries using desired hierarchy level for each dimension. In a multidi-
materialized views. mensional environment querying is often an exploratory
process, where the user moves along the dimension
hierarchies by increasing or reducing the granularity of
BACKGROUND displayed data. The drill-down operation corresponds to
an increase in detail, for example, by requesting the
Multidimensional data are obtained by applying aggrega- number of calls by region and month, starting from data
tions and statistical functions to elementary data, or more on the number of calls by region or by region and year.
precisely to data groups, each containing a subset of the Conversely, roll-up allows the user to view data at a
data and homogeneous with respect to a given set of coarser level of granularity (Agrawal, Gupta, & Sarawagi,
attributes. For example, the data Average duration of 1997; Cabibbo & Torlone, 1997).
calls in 2003 by region and call plan is obtained from the Multidimensional querying systems are commonly
so-called fact table, which is usually the product of known as OLAP (Online Analytical Processing) Systems,
complex source integration activities (Lenzerini, 2002) on in contrast to conventional OLTP (Online Transactional
the raw data corresponding to each phone call in that year. Processing) Systems. The two types have several con-

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Aggregate Query Rewriting in Multidimensional Databases

trasting features, although they share the same require- In general, query answering techniques are preferable
ment of fast online response times. In particular, one of in contexts where exact answers are unlikely to be ob- A
the key differences between OLTP and OLAP queries is tained (e.g., integration of heterogeneous data sources,
the number of records required to calculate the answer. like Web sites), and response time requirements are not
OLTP queries typically involve a rather limited number of very stringent. However, as noted in Grahne & Mendelzon
records, accessed through primary key or other specific (1999), query answering methods can be extremely ineffi-
indexes, which need to be processed for short, isolated cient, as it is difficult or even impossible to process only
transactions or to be issued on a user interface. In con- the useful views and apply optimization techniques
trast, multidimensional queries usually require the classi- such as pushing selections and joins. As a consequence,
fication and aggregation of a huge amount of data (Gupta, the rewriting approach is more appropriate in contexts
Harinarayan, & Quass, 1995) and fast response times are such as OLAP systems, where there is a very large amount
made possible by the extensive use of pre-computed of data and fast response times are required (Goldstein &
queries, called materialized views (whose answers are Larson, 2001), and for query optimization, where different
stored permanently in the database), and by sophisti- query plans need to be maintained in the main memory and
cated techniques enabling the query engine to exploit efficiently compared (Afrati, Li, & Ullman, 2001).
these pre-computed results.
Rewriting and Answering: An Example

MAIN THRUST Consider a fact table Cens, of elementary census data on


the simplified schema: (Census_tract_ID, Sex,
The problem of evaluating the answer to a query by using Empl_status, Educ_status, Marital_status) and a collec-
pre-computed (materialized) views has been extensively tion of aggregate data representing the resident popula-
studied in literature and generically denoted as answering tion by sex and marital status, stored in a materialized view
queries using views (Levy, Mendelzon, Sagiv, & on the schema V: (Sex, Marital_status, Pop_res). For
Srivastava, 1995; Halevy, 2001). The problem can be simplicity, it is assumed that the dimensional tables are
informally stated as follows: given a query Q and a collapsed in the fact table Cens. A typical multidimen-
collection of views V over the same schema s, is it possible sional query will be shown in the next section. The view
to evaluate the answer to Q by using (only) the informa- V is computed by a simple count(*)-group-by query on
tion provided by V? A more rigorous distinction has also the table Cens.
been made between view-based query rewriting and query
answering, corresponding to two distinct approaches to CREATE VIEW V AS
the general problem (Calvanese, De Giacomo, Lenzerini, & SELECT Sex, Marital_status, COUNT(*) AS Pop_res
Vardi, 2000; Halevy, 2001). This is strictly related to the FROM Cens
distinction between view definition and view extension, GROUP BY Sex, Marital_status
which is analogous to the standard distinction between
schema and instance in database literature. Broadly speak- The query Q expressed by
ing, view definition corresponds to the way the query is
syntactically defined, for example to the corresponding SELECT Marital_status, COUNT(*)
SQL expression, while its extension corresponds to the FROM Cens
set of returned tuples, that is, the result obtained by GROUP BY Marital_status
evaluating the view on a specific database instance.
corresponding to the resident population by marital sta-
Query Answering vs. Query Rewriting tus can be computed without accessing the data in Cens,
and be rewritten as follows:
Query rewriting is based on the use of view definitions to
produce a new rewritten query, expressed in terms of SELECT Marital_status, SUM(Pop_res)
available view names and equivalent to the original. The FROM V
answer can then be obtained by using the rewritten query GROUP BY Marital_status
and the view extensions (instances). Query answering, in
contrast, is based on the exploitation of both view defini- Note that the rewritten query can be obtained very
tions and extensions and attempts to determine the best efficiently by simple syntactic manipulations on Q and V
possible answer, possibly a subset of the exact answer, and its applicability does not depend on the records in V.
which can be extracted from the view extensions Suppose now some subsets of (views on) Cens are avail-
(Abiteboul & Duschka, 1998; Grahne & Mendelzon, 1999). able, corresponding to the employment statuses stu-

29

TEAM LinG
Aggregate Query Rewriting in Multidimensional Databases

dents, employed and retired, called V_ST, V_EMP and consequently to replace the calculations with access to
V_RET respectively. For example V_RET may be defined the corresponding materialized views). The results are
by: extended to NGPSJ (Nested GPSJ) expressions in
Golfarelli & Rizzi (2000).
CREATE VIEW V_RET AS In Srivastava, Dar, Jagadish, & Levy (1996) an algo-
SELECT * rithm is proposed to rewrite a single block (conjunctive)
FROM Cens SQL query with GROUP BY and aggregations using
WHERE Empl_status = retired various views of the same form. The aggregate functions
considered are MIN, MAX, COUNT and SUM. The
It is evident that no rewriting can be obtained by using algorithm is based on the detection of homomorphisms
only the specified views, both because some individuals from view to query, as in the non-aggregate context
are not present in any of the views (e.g., young children, (Levy, Mendelzon, Sagiv, & Srivastava, 1995). However,
unemployed, housewives, etc.) and because some may be it is shown that more restrictive conditions must be
present in two views (a student may also be employed). considered when dealing with aggregates, as the view
However, a query answering technique tries to collect has to produce not only the right tuples, but also their
each useful accessible record and build the best pos- correct multiplicities.
sible answer, possibly by introducing approximations. In Cohen, Nutt, & Serebrenik (1999, 2000) a somewhat
By using the information on the census tract and a match- different approach is proposed: the original query, us-
ing algorithm most overlapping records may be determined able views and rewritten query are all expressed by an
and an estimate (lower bound) of the result obtained by extension of Datalog with aggregate functions (again
summing the non-replicated contributions from the views. COUNT, SUM, MIN and MAX) as query language.
Obviously, this would require a considerable computation Queries and views are assumed to be conjunctive. Sev-
time, but it might be able to produce an approximated eral candidates for rewriting of particular forms are con-
answer, in a situation where rewriting techniques would sidered and for each candidate, the views in its body are
produce no answer at all. unfolded (i.e., replaced by their body in the view defini-
tion). Finally, the unfolded candidate is compared with
Rewriting Aggregate Queries the original query to verify equivalence by using known
equivalence criteria for aggregate queries, particularly
A typical elementary multidimensional query is de- those proposed in Nutt, Sagiv, & Shurin (1998) for COUNT,
scribed by the join of the fact table with two or more SUM, MIN and MAX queries. The technique can be
dimension tables to which is applied an aggregate group extended by using the equivalence criteria for AVG
by query (see the example query Q1 below). As a conse- queries presented in Grumbach, Rafanelli, & Tininini
quence, the rewriting of this form of query and view has (1999), based on the syntactic notion of isomorphism
been studied by many researchers. modulo a product.
In query rewriting it is important to identify the views
SELECT D1.dim1, D2.dim2, AGG(F.measure) that may be actually useful in the rewriting process: this
FROM fact_table F, dim_table1 D1, dim_table2 D2 is often referred to as the view usability problem. In the
WHERE F.dimKey1 = D1.dimKey1 non-aggregate context, it is shown (Levy, Mendelzon,
AND F.dimKey2 = D2.dimKey2 Sagiv, & Srivastava, 1995) that a conjunctive view can be
GROUP BY D1.dim1, D2.dim2 (Q1) used to produce a conjunctive rewritten query if a homo-
morphism exists from the body of the view to that of the
In Gupta, Harinarayan, & Quass (1995), an algorithm is query. Grumbach, Rafanelli, & Tininini (1999) demon-
proposed to rewrite conjunctive queries with aggrega- strate that more restrictive (necessary and sufficient)
tions using views of the same form. The technique is based conditions are needed for the usability of conjunctive
on the concept of generalized projection (GP) and some count views for rewriting of conjunctive count queries,
transformation rules utilizable by an optimizer, which en- based on the concept of sound homomorphisms. It is
ables the query and views to be put in a particular normal also shown that in the presence of aggregations, it is not
form, based on GPSJ (Generalized Projection/Selection/ sufficient only to consider rewritten queries of conjunc-
Join) expressions. The query and views are analyzed in tive form: more complex forms may be required, particu-
terms of their query tree, that is, the tree representing how larly those based on the concept of isomorphism modulo
to calculate them by applying selections, joins and gener- a product.
alized projections on the base relations. By using the All rewriting algorithms proposed in the literature are
transformation rules, the algorithm tries to produce a based on trying to obtain a rewritten query with a particu-
match between one or more view trees and subtrees (and lar form by using (possibly only) the available views. An

30

TEAM LinG
Aggregate Query Rewriting in Multidimensional Databases

interesting question is: Can I rewrite more by consider- would be interesting to study the property of complete-
ing rewritten queries of more complex form?, and the ness of known rewriting algorithms and to provide neces- A
even more ambitious one, Given a collection of views, sary and sufficient conditions for the usability of a view
is the information they provide sufficient to rewrite a to rewrite a query, even when both the query and the view
query? In Grumbach & Tininini (2003) the problem is are aggregate and of non-trivial form (e.g., allowing dis-
investigated in a general framework based on the con- junction and some limited form of negation).
cept of query subsumption. Basically, the information
content of a query is characterized by its distinguishing
power, that is, by its ability to determine that two CONCLUSION
database instances are different. Hence a collection of
views subsumes a query if it is able to distinguish any This paper has discussed a fundamental issue related to
pair of instances also distinguishable by the query, and multidimensional query evaluation, that is, how a multidi-
it is shown that a query rewriting using various views mensional query expressed in a given language can be
exists if the views subsume the query. In the particular translated, using some available materialized views, into
case of count and sum queries defined over the same fact an (efficient) evaluation plan which retrieves the neces-
table, an algorithm is proposed which is demonstrated to sary information and calculates the required results. We
be complete. In other words, even if the algorithm (as have analyzed the difference between query answering
with any algorithm of practical use) considers rewritten and query rewriting approach and discussed the main
queries of particular forms, it is shown that no improve- techniques proposed in literature to rewrite aggregate
ment could be obtained by considering rewritten que- multidimensional queries using materialized views.
ries of more complex forms.
Finally, in Grumbach & Tininini (2000) a completely
different approach to the problem of aggregate rewriting REFERENCES
is proposed. The technique is based on the idea of
formally expressing the relationships (metadata) between Abiteboul, S., & Duschka, O.M. (1998). Complexity of
raw and aggregate data and also among aggregate data of answering queries using materialized views. In ACM Sym-
different types and/or levels of detail. Data is stored in posium on Principles of Database Systems (PODS98)
standard relations, while the metadata are represented by (pp. 254-263).
numerical dependencies, namely Horn clauses formally
expressing the semantics of the aggregate attributes. The Afrati, F.N., Li, C., & Ullman, J.D. (2001). Generating
mechanism is tested by transforming the numerical de- efficient plans for queries using views. In ACM Interna-
pendencies into Prolog rules and then exploiting the tional Conference on Management of Data (SIGMOD01)
Prolog inference engine to produce the rewriting. (pp. 319-330).
Agrawal, R., Gupta, A., & Sarawagi, S. (1997). Modeling
multidimensional databases. In International Confer-
FUTURE TRENDS ence on Data Engineering (ICDE97) (pp. 232-243).
Although query rewriting techniques are currently con- Cabibbo, L., & Torlone, R. (1997). Querying multidimen-
sidered to be preferable to query answering in OLAP sional databases. In International Workshop on Data-
systems, the always increasing processing capabilities of base Programming Languages (DBPL97) (pp. 319-335).
modern computers may change the relevance of query
answering techniques in the near future. Meanwhile, the Calvanese, D., De Giacomo, G., Lenzerini, M., & Vardi,
limitations in the applicability of several rewriting algo- M.Y. (2000). What is view-based query rewriting? In
rithms shows that a substantial effort is still needed and International Workshop on Knowledge Representation
important contributions may stem from results in other meets Databases (KRDB00) (pp. 17-27).
research areas like logic programming and automated Cohen, S., Nutt, W., & Serebrenik, A. (1999). Rewriting
reasoning. Particularly, aggregate query rewriting is strictly aggregate queries using views. In ACM Symposium on
related to the problem of query equivalence for aggregate Principles of Database Systems (PODS99) (pp. 155-166).
queries and current equivalence criteria only apply to
rather simple forms of query, and dont consider, for Cohen, S., Nutt, W., & Serebrenik, A. (2000). Algorithms
example, the combination of conjunctive formulas with for rewriting aggregate queries using views. In ABDIS-
nested aggregations. DASFAA Conference 2000 (pp. 65-78).
Also the results on view usability and query Goldstein, J., & Larson, P. (2001). Optimizing queries
subsumption can be considered only preliminary and it using materialized views: A practical, scalable solution. In

31

TEAM LinG
Aggregate Query Rewriting in Multidimensional Databases

ACM International Conference on Management of Data Principles of Database Systems (PODS98) (pp. 214-
(SIGMOD01) (pp. 331-342). 223).
Golfarelli, M., & Rizzi, S. (2000). Comparing nested GPSJ Srivastava, D., Dar, S., Jagadish, H.V., & Levy, A.Y.
queries in multidimensional databases. In Workshop on (1996). Answering queries with aggregation using views.
Data Warehousing and OLAP (DOLAP 2000) (pp. 65-71). In International Conference on Very Large Data Bases
(VLDB96) (pp. 318-329).
Grahne, G., & Mendelzon, A.O. (1999). Tableau tech-
niques for querying information sources through global
schemas. In International Conference on Database KEY TERMS
Theory (ICDT99) (pp. 332-347).
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996). Data Cube: A collection of aggregate values classified
Data cube: A relational aggregation operator generalizing according to several properties of interest (dimensions).
group-by, cross-tab, and sub-total. In International Con- Combinations of dimension values are used to identify the
ference on Data Engineering (ICDE96) (pp. 152-159). single aggregate values in the cube.

Grumbach, S., Rafanelli, M., & Tininini, L. (1999). Query- Dimension: A property of the data used to classify it
ing aggregate data. In ACM Symposium on Principles of and navigate the corresponding data cube. In multidimen-
Database Systems (PODS99) (pp. 174-184). sional databases dimensions are often organized into
several hierarchical levels, for example, a time dimension
Grumbach, S., & Tininini, L. (2000). Automatic aggrega- may be organized into days, months and years.
tion using explicit metadata. In International Conference
on Scientific and Statistical Database Management Drill-Down (Roll-Up): Typical OLAP operation, by
(SSDBM00) (pp. 85-94). which aggregate data are visualized at a finer (coarser)
level of detail along one or more analysis dimensions.
Grumbach, S., & Tininini, L. (2003). On the content of
materialized aggregate views. Journal of Computer and Fact: A single elementary datum in an OLAP system,
System Sciences, 66(1), 133-168. the properties of which correspond to dimensions and
measures.
Gupta, A., Harinarayan, V., & Quass, D. (1995). Aggre-
gate-query processing in data warehousing environments. Fact Table: A table of (integrated) elementary data
In International Conference on Very Large Data Bases grouped and aggregated in the multidimensional query-
(VLDB95) (pp. 358-369). ing process.

Halevy, A.Y. (2001). Answering queries using views. Materialized View: A particular form of query whose
VLDB Journal, 10(4), 270-294. answer is stored in the database to accelerate the evalu-
ation of further queries.
Jagadish, H.V., Lakshmanan, L.V.S., & Srivastava, D.
(1999). What can hierarchies do for data warehouses? In Measure: A numeric value obtained by applying an
International Conference on Very Large Data Bases aggregate function (such as count, sum, min, max or
(VLDB99) (pp. 530-541). average) to groups of data in a fact table.

Lenzerini, M. (2002). Data integration: A theoretical per- Query Answering: Process by which the (possibly
spective. In ACM Symposium on Principles of Database approximate) answer to a given query is obtained by
Systems (PODS02) (pp. 233-246). exploiting the stored answers and definitions of a collec-
tion of materialized views.
Levy, A.Y., Mendelzon, A.O., Sagiv, Y., & Srivastava, D.
(1995). Answering queries using views. In ACM Sympo- Query Rewriting: Process by which a source query is
sium on Principles of Database Systems (PODS95) (pp. transformed into an equivalent one referring (almost ex-
95-104). clusively) to a collection of materialized views. In multidi-
mensional databases, query rewriting is fundamental in
Nutt, W., Sagiv, Y., & Shurin, S. (1998). Deciding equiva- achieving acceptable (online) response times.
lences among aggregate queries. In ACM Symposium on

32

TEAM LinG
33

Aggregation for Predictive Modeling with A


Relational Data
Claudia Perlich
IBM Research, USA

Foster Provost
New York University, USA

INTRODUCTION and the information aggregation is based on existential


unification. More recent relational learning approaches
Most data mining and modeling techniques have been include distance-based methods (Kirsten et al., 2001),
developed for data represented as a single table, where propositionalization (Kramer et al., 2001; Knobbe et
every row is a feature vector that captures the character- al., 2001; Krogel et al., 2003), and upgrades of proposi-
istics of an observation. However, data in most domains tional learners such as Nave Bayes (Neville et al.,
are not of this form and consist of multiple tables with 2003), Logistic Regression (Popescul et al., 2002),
several types of entities. Such relational data are ubiq- Decision Trees (Jensen & Neville, 2002) and Bayesian
uitous; both because of the large number of multi-table Networks (Koller & Pfeffer, 1998). Similar to manual
relational databases kept by businesses and government feature construction, both upgrades and
organizations, and because of the natural, linked nature propositionalization use Boolean conditions and com-
of people, organizations, computers, and etc. Relational mon aggregates like min, max, or sum to transform
data pose new challenges for modeling and data mining, either explicitly (propositionalization) or implicitly
including the exploration of related entities and the aggre- (upgrades) the original relational domain into a tradi-
gation of information from multi-sets (bags) of related tional feature-vector representation.
entities. Recent work by Knobbe et al. (2001) and Wrobel &
Krogel (2001) recognizes the essential role of aggrega-
tion in all relational modeling and focuses specifically
BACKGROUND on the effect of aggregation choices and parameters.
Wrobel & Krogel (2003) present one of the few empiri-
Relational learning differs from traditional feature- cal comparisons of aggregation in propositionalization
vector learning both in the complexity of the data repre- approaches (however with inconclusive results). Perlich
sentation and in the complexity of the models. The & Provost (2003) show that the choice of aggregation
relational nature of a domain manifests itself in two operator can have a much stronger impact on the result-
ways: (1) entities are not limited to a single type, and (2) ant models generalization performance than the choice
entities are related to other entities. Relational learning of the model induction method (decision trees or logis-
allows the incorporation of knowledge from entities in tic regression, in their study).
multiple tables, including relationships between ob-
jects of varying cardinality. Thus, in order to succeed,
relational learners have to be able to identify related MAIN THRUST
objects and to aggregate information from bags of re-
lated objects into a final prediction. For illustration, imagine a direct marketing task where
Traditionally, the analysis of relational data has in- the objective is to identify customers who would re-
volved the manual construction by a human expert of spond to a special offer. Available are demographic
attributes (e.g., the number of purchases of a customer information and all previous purchase transactions, which
during the last three months) that together will form a include PRODUCT, TYPE and PRICE. In order to take
feature vector. Automated analysis of relational data is advantage of these transactions, information has to be
becoming increasingly important as the number and aggregated. The choice of the aggregation operator is
complexity of databases increases. Early research on crucial, since aggregation invariably involves loss of
automated relational learning was dominated by Induc- (potentially discriminative) information.
tive Logic Programming (Muggleton, 1992), where the Typical aggregation operators like min, max and sum
classification model is a set of first-order-logic clauses can only be applied to sets of numeric values, not to

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Aggregation for Predictive Modeling with Relational Data

objects (an exception being count). It is therefore neces- worked data with identifier attributes. As Knobbe et al.
sary to assume class-conditional independence and ag- (1999) point out, traditional aggregation operators like
gregate the attributes independently, which limits the min, max, and count are based on histograms. A histo-
expressive power of the model. Perlich & Provost (2003) gram itself is a crude approximation of the underlying
discuss in detail the implications of various assump- distribution. Rather than estimating one distribution for
tions and aggregation choices on the expressive power every bag of attributes, as done by traditional aggrega-
of resulting classification models. For example, cus- tion operators, this new aggregation approach estimates
tomers who buy mostly expensive books cannot be in a first step only one distribution for each class, by
identified if price and type are aggregated separately. In combining all bags of objects for the same class. The
contrast, ILP methods do not assume independence and combination of bags of related objects results in much
can express an expensive book (TYPE=BOOK and better estimates of the distribution, since it uses many
PRICE>20); however aggregation through existential more observations. The number of parameters differs
unification can only capture whether a customer bought across distributions: for a normal distribution only two
at least one expensive book, not whether he has bought parameters are required, mean and variance, whereas
primarily expensive books. Only two systems, POLKA distributions of categorical attributes have as many
(Knobbe et al., 2001) and REGLAGGS (Wrobel & parameters as possible attribute values. In a second step,
Krogel, 2001) combine Boolean conditions and nu- the bags of attributes of related objects are aggregated
meric aggregates to increase the expressive power of through vector distances (e.g., Euclidean, Cosine, Like-
the model. lihood) between a normalized vector-representation of
Another challenge is posed by categorical attributes the bag and the two class-conditional distributions.
with many possible values, such as ISBN numbers of Imagine the following example of a document clas-
books. Categorical attributes are commonly aggregated sification domain with two tables (Document and Au-
using mode (the most common value) or the count for thor) shown in Figure 1.
all values if the number of different values is small. The first aggregation step estimates the class-
These approaches would be ineffective for ISBN: it has conditional distributions DClass n of authors from the
many possible values and the mode is not meaningful Author table. Under the alphabetical ordering of
since customers usually buy only one copy of each position:value pairs, 1:A, 2:B, and 3:C, the value for
book. Many relational domains include categorical at- DClass n at position k is defined as:
tributes of this type. One common class of such do-
mains involves networked data, where most of the infor-
Number of occurrences of author k in the set of authors related to documents of class n
mation is captured by the relationships between objects, DClass n[k] =
Number of authors related to documents of class n

possibly without any further attributes. The identity of


an entity (e.g., Bill Gates) in social, scientific, and
The resulting estimates of the class-conditional dis-
economic networks may play a much more important
tributions for our example are given by:
role than any of its attributes (e.g., age or gender).
Identifiers such as name, ISBN, or SSN are categorical
DClass 0 = [0.5 0 0.5] and DClass 1 = [0.4 0.4 0.2]
attributes with excessively many possible values that
cannot be accounted for by either mode or count.
The second aggregation step is the representation
Perlich and Provost (2003) present a new multi-step
of every document as a vector:
aggregation methodology based on class-conditional
distributions that shows promising performance on net-
Number of occurrences of author k related to the document Pn
DPn[k] =
Number of authors related to document Pn

Figure 1. Example domain with two tables that are


linked through Paper ID The vector-representation for the above examples
are D P1 = [1 0 0], D P2 = [0.5 0.5 0], D P3 = [0.33 0.33
Document Table Author Table
0.33], and DP4 = [0 0 1].
Paper ID Class Paper ID Author Name
The third aggregation step calculates vector dis-
P1 0 P1 A
P2 1 P2 B
tances (e.g., cosine) between the class-conditional distri-
P3 1 P2 A
bution and the documents DP1,...,DP4. The new Document
P4 0 P3 B table with the additional cosine features is shown in
P3 A Figure 2. In this simple example, the distance from DClass
P3 C 1
separates the examples perfectly; the distance from DClass
P4 C 0
does not.

34

TEAM LinG
Aggregation for Predictive Modeling with Relational Data

Figure 2. Extended document table with new cosine high-dimensional categorical fields (author names and
features added document identifiers) are not applicable. A
The generalization performance of the new aggrega-
Document Table tion approach is related to a number of properties that are
Paper ID Class Cosine(Pn, DClass 1) Cosine(Pn, DClass 0)
of particular relevance and advantage for predictive
P1 0 0.667 0.707
P2 1 0.707 0.5 modeling:
P3 1 0.962 0.816
P4 0 0.333 0.707 Dimensionality Reduction: The use of distances
compresses the high-dimensional space of pos-
sible categorical values into a small set of dimen-
sions one for each class and distance metric. In
By taking advantage of DClass 1 and D Class 0 another new
particular, this allows the aggregation of object
aggregation approach becomes possible. Rather than
identifiers.
constructing counts for all distinct values (impossible
Preservation of Discriminative Information:
for high-dimensional categorical attributes) one can se-
Changing the class labels of the target objects
lect a small subset of values where the absolute differ-
will change the values of the aggregates. The loss
ence between entries in DClass 0 and D Class 1 is maximal. This
of discriminative information is lower since the
method would identify author B as the most discriminative.
class-conditional distributions capture signifi-
These new features, constructed from class-condi-
cant differences.
tional distributions, show superior classification per-
Domain Independence: The density estimation
formance on a variety of relational domains (Perlich &
does not require any prior knowledge about the
Provost, 2003, 2004). Table 1 summarizes the relative
application domain and therefore is suitable for a
out-of-sample performances (averaged over 10 experi-
variety of domains.
ments with standard deviations in parentheses) as pre-
Applicability to Numeric Attributes: The ap-
sented in Perlich (2003) on the CORA document classi-
proach is not limited to categorical values but can
fication task (McCallum et al., 2000) for 400 training
also be applied to numeric attributes after
examples. The data set includes information about the
discretization. Note that using traditional aggre-
authorship, citations, and the full text. This example also
gation through mean and variance assumes im-
demonstrates the opportunities arising from the ability
plicitly a normal distribution; whereas this aggre-
of relational models to take advantage of additional
gation makes no prior distributional assumptions
background information such as citations and authorship
and can capture arbitrary numeric distributions.
over simple text classification. The comparison includes
Monotonic Relationship: The use of distances to
in addition to two distribution-based feature construc-
class-conditional densities constructs numerical
tion approaches (1 and 2) using logistic regression for
features that are monotonic in the probability of
model induction: 3) a Nave Bayes classifier using the
class membership. This makes logistic regression
full text learned by the Rainbow (McCallum, 1996)
a natural choice for the model induction step.
system, 4) a Probabilistic Relational Model (Koller &
Aggregation of Identifiers: By using object iden-
Pfeffer, 1998) using traditional aggregates on both text
tifiers such as names it can overcome some of the
and citation/authorship with the results reported by Taskar
limitations of the independence assumptions and
et al. (2001), and 5) a Simple Relational Classifier (Macskassy
even allow the learning from unobserved object
& Provost, 2003) that uses only the known class labels of
properties (Perlich & Provost, 2004). The identifier
related (e.g., cited) documents. It is important to observe
represents the full information of the object and in
that traditional aggregation operators such as mode for

Table 1. Comparative classification performance

Method Used Information Accuracy

1) Class-Conditional Distributions (Authorship & Citations) 0.78 (0.01)


2) Class-Conditional Distributions and Most Discriminative Counts (Authorship & Citations) 0.81 (0.01)
3) Nave Bayes Classifier using Rainbow (Text) 0.74 (0.03)
4) Probabilistic Relational Model (Text, Authorship & Citiations) 0.74 (0.01)
5) Simple Relational Model (Related Class Labels) 0.68 (0.01)

35

TEAM LinG
Aggregation for Predictive Modeling with Relational Data

particular the joint distribution of all other attributes The potential complexity of relational models and the
and even further unknown properties. resulting computational complexity of relational modeling
Task-Specific Feature Construction: The advan- remains an obstacle to real-time applications. This limita-
tages outlined above are possible through the use tion has spawned work in efficiency improvements (Yin et
the target value during feature construction. This al., 2003; Tang et al., 2003) and will remain an important
practice requires the splitting of the training set into task.
two separate portions for 1) the class-conditional
density estimation and feature construction and 2)
the estimation of the classification model. CONCLUSION
To summarize, most relational modeling has limited Relational modeling is a burgeoning topic within ma-
itself to a small set of existing aggregation operators. chine learning research, and is applicable commonly in
The recognition of the limited expressive power moti- real-world domains. Many domains collect large
vated the combination of Boolean conditioning and amounts of transaction and interaction data, but so far
aggregation, and the development of new aggregation lack a reliable and automated mechanism for model
methodologies that are specifically designed for pre- estimation to support decision-making. Relational
dictive relational modeling. modeling with appropriate aggregation methods has the
potential to fill this gap and allow the seamless integration
of model estimation on top of existing relational data-
FUTURE TRENDS bases, relieving the analyst from the manual, time-con-
suming, and omission-prone task of feature construction.
Computer-based analysis of relational data is becoming
increasingly necessary as the size and complexity of
databases grow. Many important tasks, including REFERENCES
counter-terrorism (Tang et al., 2003), social and eco-
nomic network analysis (Jensen & Neville, 2002), docu- Deroski, S. (2001). Relational data mining applications:
ment classification (Perlich, 2003), customer relation- An overview, In S. D eroski & N. Lavra (Eds.), Rela-
ship management, personalization, fraud detection tional data mining (pp. 339-364). Berlin: Springer Verlag.
(Fawcett & Provost, 1997), and genetics [e.g., see the
overview by Deroski (2001)], used to be approached with Fawcett, T., & Provost, F. (1997). Adaptive fraud detec-
special-purpose algorithms, but now are recognized as tion. Data Mining and Knowledge Discovery, (1).
inherently relational. These application domains both
Jensen, D., & Neville, J. (2002). Data mining in social
profit from and contribute to research in relational model-
networks. In R. Breiger, K. Carley, & P. Pattison (Eds.),
ing in general and aggregation for feature construction in
Dynamic social networks modeling and analysis (pp.
particular.
287-302). The National Academies Press.
In order to accommodate such a variety of domains,
new aggregators must be developed. In particular, it is Kirsten, M., Wrobel, S., & Horvath, T. (2001). Distance
necessary to account for domain-specific dependencies based approaches to relational learning and clustering.
between attributes and entities that currently are ig- In S. Deroski & N. Lavra (Eds.), Relational data mining
nored. One common type of such dependency is the (pp. 213-234). Berlin: Springer Verlag.
temporal order of events which is important for the
discovery of causal relationships. Knobbe A.J., de Haas, M., & Siebes, A. (2001).
Aggregation as a research topic poses the opportu- Propositionalisation and aggregates. In L. DeRaedt & A.
nity for significant theoretical contributions. There is Siebes (Eds.), Proceedings of the Fifth European Confer-
little theoretical work on relational model estimation ence on Principles of Data Mining and Knowledge
outside of first-order logic. In contrast to a large body Discovery (LNAI 2168) (pp. 277-288). Berlin: Springer
of work in mathematics and the estimation of functional Verlag.
dependencies that map well-defined input spaces to Koller, D., & Pfeffer, A. (1998). Probabilistic frame-
output spaces, aggregation operators have not been in- based systems. In Proceedings of Fifteenth/Tenth Con-
vestigated nearly as thoroughly. Model estimation tasks ference on Artificial Intelligence/Innovative Applica-
are usually framed as search over a structured (either in tion of Artificial Intelligence (pp. 580-587). American
terms of parameters or increasing complexity) space of Association for Artificial Intelligence.
possible solutions. But the structuring of a search space
of aggregation operators remains an open question.

36

TEAM LinG
Aggregation for Predictive Modeling with Relational Data

Kramer, S., Lavra , N., & Flach, P. (2001). Proposition- fier attributes. Working Paper CeDER-04-04. Stern School
alization approaches to relational data mining. In S. of Business. A
Deroski & N. Lavra (Eds.), Relational data mining (pp.
Popescul, L., Ungar, H., Lawrence, S., & Pennock, D.M.
262-291). Berlin: Springer Verlag.
(2002). Structural logistic regression: Combining rela-
Krogel, M.A., Rawles, S., Zelezny, F., Flach, P.A., Lavrac, tional and statistical learning. In Proceedings of the
N., & Wrobel, S. (2003). Comparative evaluation of ap- Workshop on Multi-Relational Data Mining.
proaches to propositionalization. In T. Horvth & A.
Tang, L.R., Mooney, R.J., & Melville, P. (2003). Scaling up
Yamamoto (Eds.), Proceedings of the 13th International
ILP to large examples: Results on link discovery for
Conference on Inductive Logic Programming (LNAI
counter-terrorism. In Proceedings of the Workshop on
2835) (pp. 197-214). Berlin: Springer-Verlag.
Multi-Relational Data Mining (pp. 107-121).
Krogel, M.A., & Wrobel, S. (2001). Transformation-based
Taskar, B., Segal, E., & Koller, D. (2001). Probabilistic
learning using multirelational aggregation. In C. Rouveirol
classification and clustering in relational data. In Pro-
& M. Sebag (Eds.), Proceedings of the Eleventh Interna-
ceedings of the 17th International Joint Conference
tional Conference on Inductive Logic Programming (ILP)
on Artificial Intelligence (pp. 870-878).
(LNAI 2157) (pp. 142-155). Berlin: Springer Verlag.
Yin, X., Han, J., & Yang, J. (2003). Efficient multi-
Krogel M.A., & Wrobel, S. (2003). Facets of aggregation
relational classification by tuple ID propagation. In
approaches to propositionalization. In T. Horvth & A.
Proceedings of the Workshop on Multi-Relational
Yamamoto (Eds.), Proceedings of the Work-in-Progress
Data Mining.
Track at the 13th International Conference on Inductive
Logic Programming (pp. 30-39).
Macskassy, S.A., & Provost, F. (2003). A simple relational
classifier. In Proceedings of the Workshop of Multi- KEY TERMS
Relational Data Mining at SIGKDD-2003.
Aggregation: Also commonly called a summary, an
McCallum, A.K. (1996). Bow: A toolkit for statistical aggregation is the calculation of a value from a bag or
language modeling, text retrieval, classification and (multi)set of entities. Typical aggregations are sum,
clustering. Retrieved from http://www.cs.cmu.edu/ count, and average.
~mccallum/bow
Discretization: Conversion of a numeric variable
McCallum, A.K., Nigam, K., Rennie, J., & Seymore, K. into a categorical variable, usually though binning. The
(2000). Automating the construction of Internet portals entire range of the numeric values is split into a number
with machine learning. Information Retrieval, 3(2), 127- of bins. The numeric value of the attributes is replaced by
163. the identifier of the bin into which it falls.
Muggleton, S. (Ed.). (1992). Inductive logic program- Class-Conditional Independence: Property of a multi-
ming. London: Academic Press. variate distribution with a categorical class variable c and
a set of other variables (e.g., x and y). The probability of
Neville J., Jensen, D., & Gallagher, B. (2003). Simple
observing a combination of variable values given the
estimators for relational bayesian classifers. In Pro-
class label is equal to the product of the probabilities of
ceedings of the Third IEEE International Conference
each variable value given the class: P(x,y|c) = P(x|c)*P(y|c).
on Data Mining (pp. 609-612).
Inductive Logic Programming: A field of research at
Perlich, C. (2003). Citation-based document classifi-
the intersection of logic programming and inductive ma-
cation. In Proceedings of the Workshop on Informa-
chine learning, drawing ideas and methods from both
tion Technology and Systems (WITS).
disciplines. The objective of ILP methods is the inductive
Perlich, C., & Provost, F. (2003). Aggregation-based construction of first-order Horn clauses from a set of
feature invention and relational concept classes. In Pro- examples and background knowledge in relational form.
ceedings of the Ninth ACM SIGKDD International Con-
Propositionalization: The process of transforming a
ference on Knowledge Discovery and Data Mining.
multi-relational dataset, containing structured examples,
Perlich, C., & Provost, F. (2004). ACORA: Distribution- into a propositional data set (one table) with derived
based aggregation for relational learning from identi- attribute-value features, describing the structural proper-
ties of the example.

37

TEAM LinG
Aggregation for Predictive Modeling with Relational Data

Relational Data: Data where the original information Relational Learning: Learning in relational domains
cannot be represented in a single table but requires two that include information from multiple tables, not based
or more tables in a relational database. Every table can on manual feature construction.
either capture the characteristics of entities of a particular
type (e.g., person or product) or relationships between Target Objects: Objects in a particular target tables
entities (e.g., person bought product). for which a prediction is to be made. Other objects reside
in additional background tables, but are not the focus
of the prediction task.

38

TEAM LinG
39

API Standardization Efforts for Data Mining A


Jaroslav Zendulka
Brno University of Technology, Czech Republic

INTRODUCTION software vendors to be easily plugged into applications.


A software package that provides data mining services is
Data mining technology just recently became actually called data mining provider and an application that
usable in real-world scenarios. At present, the data mining employs these services is called data mining consumer.
models generated by commercial data mining and statis- The data mining provider itself includes three basic archi-
tical applications are often used as components in other tectural components (Hornick et al., 2002):
systems in such fields as customer relationship manage-
ment, risk management or processing scientific data. API the End User Visible Component: An appli-
Therefore, it seems to be natural that most data mining cation developer using a data mining provider has
products concentrate on data mining technology rather to know only its API.
than on the easy-to-use, scalability, or portability. It is Data Mining Engine (or Server): the core com-
evident that employing common standards greatly simpli- ponent of a data mining provider. It provides an
fies the integration, updating, and maintenance of appli- infrastructure that offers a set of data mining
cations and systems containing components provided by services to data mining consumers.
other producers (Grossman, Hornick, & Meyer, 2002). Metadata Repository: a repository that serves to
Data mining models generated by data mining algorithms store data mining metadata.
are good examples of such components.
Currently, established and emerging standards ad- The standard APIs presented in this paper are not
dress especially the following aspects of data mining: designed to support the entire knowledge discovery
process but the data mining step only (Han & Kamber,
Metadata: for representing data mining metadata 2001). They do not provide all necessary facilities for data
that specify a data mining model and results of cleaning, transformations, aggregations, and other data
model operations (CWM, 2001). preparation operations. It is assumed that data prepara-
Application Programming Interfaces (APIs): for tion is done before an appropriate data mining algorithm
employing data mining components in applications. offered by the API is applied.
Process: for capturing the whole knowledge dis- There are four key concepts that are supported by the
covery process (CRISP-DM, 2000). APIs: a data mining model, data mining task, data mining
technique, and data mining algorithm. The data mining
In this paper, we focus on standard APIs. The objec- model is a representation of a given set of data. It is the
tive of these standards is to facilitate integration of data result of one of the data mining tasks, during which a data
mining technology with application software. Probably mining algorithm for a given data mining technique
the best-known initiatives in this field are OLE DB for Data builds the model. For example, a decision tree as one of the
Mining (OLE DB for DM), SQL/MM Data Mining (SQL/ classification models is the result of a run of a decision
MM DM), and Java Data Mining (JDM). tree-based algorithm.
Another standard, which is not an API but is important The basic data mining tasks that the standard APIs
for integration and interoperability of data mining prod- support enable users to:
ucts and applications, is a Predictive Model Markup
Language (PMML). It is a standard format for data mining 1. Build a data mining model. This task consists of two
model exchange developed by Data Mining Group (DMG) steps. First the data model is defined, that is, the
(PMML, 2003). It is supported by all the standard APIs source data that will be mined is specified, the
presented in this paper. source data structure (referred to as physical schema)
is mapped on inputs of a data mining algorithm
(referred to as logical schema), and the algorithm
BACKGROUND used to build the data mining model is specified.
Then, the data mining model is built from training
The goal of data mining API standards is to make it data.
possible for different data mining algorithms from various
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
API Standardization Efforts for Data Mining

2. Test the quality of a mining model by applying oriented specification for a set of data access interfaces
testing data. designed for record-oriented data stores. It employs SQL
3. Apply a data mining model to new data. commands as arguments of interface operations. The
4. Browse a data mining model for reporting and visu- approach in defining OLE DB for DM was not to extend
alization applications. OLE DB interfaces but to expose data mining interfaces in
a language-based API.
The APIs support several commonly accepted and OLE DB for DM treats a data mining model as if it were
widely used techniques both for predictive and descrip- a special type of table: (a) Input data in the form of a set
tive data mining (see Table 1). Not all techniques need all of cases is associated with a data mining model and
the tasks listed above. For example, association rule additional meta-information while defining the data min-
mining does not require testing and application to new ing model. (b) When input data is inserted into the data
data, whereas classification does. mining model (it is populated), a mining algorithm
The goals of the APIs are very similar but the approach builds an abstraction of the data and stores it into this
of each of them is different. OLE DB for DM is a language- special table. For example, if the data model represents a
based interface, SQL/MM DM is based on user-defined decision tree, the table contains a row for each leaf node
data types in SQL:1999, and JDM contains packages of of the tree (Netz et al., 2001). Once the data mining model
data mining oriented Java interfaces and classes. is populated, it can be used for prediction, or it can be
In the next section, each of the APIs is briefly charac- browsed for visualization.
terized. An example showing their application in predic- OLE DB for DM extends syntax of several SQL state-
tion is presented in another article in this encyclopedia. ments for defining, populating, and using a data mining
model see Figure 1.

MAIN THRUST SQL/MM Data Mining

OLE DB for Data Mining SQL/MM DM is an international ISO/IEC standard (SQL,


2002), which is part of the SQL Multimedia and Applica-
OLE DB for DM (OLE DB, 2000) is Microsofts API that tion Packages (SQL/MM) (Melton & Eisenberg, 2001). It
aims to become the industry standard. It provides a set of is based on SQL:1999 and its structured user-defined data
extensions to OLE DB, which is a Microsofts object- types (UDT). The structured UDT is the fundamental

Table 1. Supported data mining techniques

Technique OLE DB for DM SQL/MM DM JDM


Association rules X X X
Clustering (segmentation) X X X
Classification X X X
Sequence and deviation analysis X
Density estimation X
Regression X
Approximation X
Attribute importance X

Figure 1. Extended SQL statements in OLE DB for DM

source data columns


INSERT model columns SELECT

mining algorithm
algorithm settings

CREATE

Populating the data Defining a data Testing, applying, browsing


2 1 mining model 3
mining model the data mining model

40

TEAM LinG
API Standardization Efforts for Data Mining

facility in SQL:1999 that supports object orientation (Melton FUTURE TRENDS


& Simon, 2001). The idea of SQL/MM DM is to provide A
UDTs and associated methods for defining input data, a OLE DB for DM is a Microsofts standard which aims to
data mining model, data mining task and its settings, and for be an industry standard. A reference implementation of
results of testing or applying the data mining model. Train- a data mining provider based on this standard is available
ing, testing, and application data must be stored in a table. in Microsoft SQL Server 2000 (Netz et al., 2001). SQL/MM
Relations of the UDTs are shown in Figure 2. DM was adopted as an international ISO/IEC standard.
Some of the UDTs are related to mining techniques. As it is based on a user-defined data type feature of
Their names contain XX in the figure, which should be SQL:1999, support of UDTs in database management
Clas, Rule, Clus, and Reg for classification, asso- systems is essential for implementations of data mining
ciation rules, clustering and regression, respectively. providers based on this standard.
JDM must still go through several steps before being
Java Data Mining accepted as an official Java standard. At the time of
writing this paper, it was in the stage of final draft public
Java Data Mining (JDM) known as a Java Specification review. The Oracle9i Data Mining (Oracle9i, 2002) API
Request 73 (JSR-73) (Hornick et al., 2004) is a Java provides an early look at concepts and approaches
standard being developed under SUNs Java Community proposed for JDM. It is assumed to comply with the JDM
Process. The standard is based on a generalized, object- standard when the standard is published.
oriented, data mining conceptual model. JDM supports All the standards support PMML as a format for data
common data mining operations, as well as the creation, mining model exchange. They enable a data mining model
persistence, access, and maintenance of metadata sup- to be imported and exported in this format. In OLE DB for
porting mining activities. DM, a new model can be created from a PMML document.
Compared with OLE DB for DM and SQL/MM DM, SQL/MM DM provides methods of the DM_XXModel
JDM is more complex because it does not rely on any other UDT to import and export a PMML document. Similarly,
built-in support, such as OLE DB or SQL. It is a pure Java JDM specifies interfaces for import and export tasks.
API that specifies a set of Java interfaces and classes,
which must be implemented in a data mining provider.
Some of JDM concepts are close to those in SQL/MM DM CONCLUSION
but the number of Java interfaces and classes in JDM is
higher than the number of UDTs in SQL/MM DM. JDM Three standard APIs for data mining were presented in
specifies interfaces for objects which provide an abstrac- this paper. However, their implementation is not avail-
tion of the metadata needed to execute data mining tasks. able yet or is only a reference one. Schwenkreis (2001)
Once a task is executed, another object that represents the commented on this situation, as, the fact that implemen-
result of the task is created. tations come after standards is a general trend in todays
standardization efforts. It seems that in case of data

Figure 2. Relations of data mining UDTs introduced in SQL/MM DM

Testing data

DM_MiningData
DM_MinigData
DM_XXTestResult

DM_LogicalDataSpec DM_XXModel
Training data

DM_XXSettings
DM_ApplicationData DM_XXResult

DM_XXBldTask

Application data

41

TEAM LinG
API Standardization Efforts for Data Mining

mining, standards are not only intended to unify existing SAS Enterprise Miner to support PMML. (September
products with well-known functionality but to (partially) 17, 2002). Retrieved from http://www.sas.com/news/
design the functionality such that future products match preleases/091702/news1.html
real world requirements. A simple example of using the
APIs in prediction is presented in another article of this Schwenkreis, F. (2001). Data mining Technology driven
book (Zendulka, 2005). by standards? Retrieved from http://www.research.
microsoft.com/~jamesrh/hpts2001/submissions/
FriedemannSchwenkreis.htm
REFERENCES SQL Multimedia and Application Packages. Part 6: Data
Mining. ISO/IEC 13249-6. (2002).
Common Warehouse Metamodel Specification: Data
Mining. Version 1.0. (2001). Retrieved from http:// Zendulka, J. (2005). Using standard APIs for data mining
www.omg.org/docs/ad/01-02-01.pdf in prediction. In J. Wang (Ed.) Encyclopedia of data
warehousing and mining. Hershey, PA: Idea Group Ref-
Cross Industry Standard Process for Data Mining erence.
(CRISP-DM). Version 1.0. (2000). Retrieved from http://
www.crisp-dm.org/
KEY TERMS
Grossman, R.L., Hornick, M.F., & Meyer, G. (2002). Data
mining standards initiatives. Communications of the ACM, API: Application programming interface (API) is a
45 (8), 59-61. description of the way one piece of software asks another
Han, J., & Kamber, M. (2001). Data mining: concepts and program to perform a service. A standard API for data
techniques. Morgan Kaufmann Publishers. mining enables for different data mining algorithms from
various vendors to be easily plugged into application
Hornick, M. et al. (2004). JavaSpecification Request programs.
73: JavaData Mining (JDM). Version 0.96. Retrieved
from http://jcp.org/aboutJava/communityprocess/first/ Data Mining Model: A high-level global description
jsr073/ of a given set of data which is the result of a data mining
technique over the set of data. It can be descriptive or
Melton, J., & Eisenberg, A. (2001). SQL Multimedia and predictive.
Application Packages (SQL/MM). SIGMOD Record, 30
(4), 97-102. DMG: Data Mining Group (DMG) is a consortium of
data mining vendors for developing data mining stan-
Melton, J., & Simon, A. (2001). SQL: 1999. Understanding dards. They have developed a Predictive Model Markup
relational language components. Morgan Kaufmann language (PMML).
Publishers.
JDM: Java Data Mining (JDM) is an emerging stan-
Microsoft Corporation. (2000). OLE DB for Data Mining dard API for the programming language Java. It is an
Specification Version 1.0. object-oriented interface that specifies a set of Java classes
and interfaces supporting data mining operations for
Netz, A. et al. (2001, April). Integrating data mining with building, testing, and applying a data mining model.
SQL Databases: OLE DB for data mining. In Proceedings
of the 17 th International Conference on Data Engineer- OLE DB for DM: OLE DB for Data Mining (OLE DB for
ing (ICDE 01) (pp. 379-387). Heidelberg, Germany. DM) is a Microsofts language-based standard API that
introduces several SQL-like statements supporing data
Oracle9i Data Mining. Concepts. Release 9.2.0.2. (2002). mining operations for building, testing, and applying a
Viewable CD Release 2 (9.2.0.2.0). data mining model.
PMML Version 2.1. (2003). Retrieved from http:// PMML: Predictive Model Markup Language (PMML)
www.dmg.org/pmml-v2-1.html is an XML-based language which provides a quick and
Saarenvirta, G. (2001, Summer). Operation Data Mining. easy way for applications to produce data mining models
DB2 Magazine, 6(2). International Business Machines in a vendor-independent format and to share them be-
Corporation. Retrieved from http://www.db2mag.com/ tween compliant applications.
db_area/archives/2001/q2/saarenvirta.shtml

42

TEAM LinG
API Standardization Efforts for Data Mining

SQL1999: Structured Query Language (SQL): 1999. SQL/MM DM: SQL Multimedia and Application Pack-
The version of the standard database language SQL ages Part 6: Data Mining (SQL/MM DM) is an interna- A
adapted in 1999, which introduced object-oriented fea- tional standard the purpose of which is to define data
tures. mining user-defined types and associated routines for
building, testing, and applying data mining models. It is
based on structured user-defined types of SQL:1999.

43

TEAM LinG
44

The Application of Data Mining to


Recommender Systems
J. Ben Schafer
University of Northern Iowa, USA

INTRODUCTION ries to a user concerning the latest news regarding a


senators re-election campaign.
In a world where the number of choices can be overwhelm- Without computers, a person often receives recom-
ing, recommender systems help users find and evaluate mendations by listening to what people around him have
items of interest. They connect users with items to con- to say. If many people in the office state that they enjoyed
sume (purchase, view, listen to, etc.) by associating the a particular movie, or if someone he tends to agree with
content of recommended items or the opinions of other suggests a given book, then he may treat these as recom-
individuals with the consuming users actions or opin- mendations. Collaborative filtering (CF) is an attempt to
ions. Such systems have become powerful tools in do- facilitate this process of word of mouth. The simplest of
mains from electronic commerce to digital libraries and CF systems provide generalized recommendations by
knowledge management. For example, a consumer of just aggregating the evaluations of the community at large.
about any major online retailer who expresses an interest More personalized systems (Resnick & Varian, 1997)
in an item either through viewing a product description employ techniques such as user-to-user correlations or a
or by placing the item in his shopping cart will likely nearest-neighbor algorithm.
receive recommendations for additional products. These The application of user-to-user correlations derives
products can be recommended based on the top overall from statistics, where correlations between variables are
sellers on a site, on the demographics of the consumer, or used to measure the usefulness of a model. In recommender
on an analysis of the past buying behavior of the con- systems correlations are used to measure the extent of
sumer as a prediction for future buying behavior. This agreement between two users (Breese, Heckerman, &
paper will address the technology used to generate rec- Kadie, 1998) and used to identify users whose ratings will
ommendations, focusing on the application of data min- contain high predictive value for a given user. Care must
ing techniques. be taken, however, to identify correlations that are actu-
ally helpful. Users who have only one or two rated items
in common should not be treated as strongly correlated.
BACKGROUND Herlocker et al. (1999) improved system accuracy by
applying a significance weight to the correlation based on
Many different algorithmic approaches have been ap- the number of co-rated items.
plied to the basic problem of making accurate and efficient Nearest-neighbor algorithms compute the distance
recommender systems. The earliest recommender sys- between users based on their preference history. Dis-
tems were content filtering systems designed to fight tances vary greatly based on domain, number of users,
information overload in textual domains. These were often number of recommended items, and degree of co-rating
based on traditional information filtering and information between users. Predictions of how much a user will like an
retrieval systems. Recommender systems that incorpo- item are computed by taking the weighted average of the
rate information retrieval methods are frequently used to opinions of a set of neighbors for that item. As applied in
satisfy ephemeral needs (short-lived, often one-time recommender systems, neighbors are often generated
needs) from relatively static databases. For example, re- online on a query-by-query basis rather than through the
questing a recommendation for a book preparing a sibling off-line construction of a more thorough model. As such,
for a new child in the family. Conversely, recommender they have the advantage of being able to rapidly incorpo-
systems that incorporate information-filtering methods rate the most up-to-date information, but the search for
are frequently used to satisfy persistent information (long- neighbors is slow in large databases. Practical algorithms
lived, often frequent, and specific) needs from relatively use heuristics to search for good neighbors and may use
stable databases in domains with a rapid turnover or opportunistic sampling when faced with large popula-
frequent additions. For example, recommending AP sto- tions.

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
The Application of Data Mining to Recommender Systems

Both nearest-neighbor and correlation-based of features for the items being classified or data about
recommenders provide a high level of personalization in relationships among the items. The category is a domain- A
their recommendations, and most early systems using specific classification such as malignant/benign for tumor
these techniques showed promising accuracy rates. As classification, approve/reject for credit requests, or in-
such, CF-based systems have continued to be popular in truder/authorized for security checks. One way to build a
recommender applications and have provided the bench- recommender system using a classifier is to use informa-
marks upon which more recent applications have been tion about a product and a customer as the input, and to
compared. have the output category represent how strongly to
recommend the product to the customer. Classifiers may
be implemented using many different machine-learning
DATA MINING IN RECOMMENDER strategies including rule induction, neural networks, and
APPLICATIONS Bayesian networks. In each case, the classifier is trained
using a training set in which ground truth classifications
The term data mining refers to a broad spectrum of math- are available. It can then be applied to classify new items
ematical modeling techniques and software tools that are for which the ground truths are not available. If subse-
used to find patterns in data and user these to build quent ground truths become available, the classifier may
models. In this context of recommender applications, the be retrained over time.
term data mining is used to describe the collection of For example, Bayesian networks create a model based
analysis techniques used to infer recommendation rules on a training set with a decision tree at each node and
or build recommendation models from large data sets. edges representing user information. The model can be
Recommender systems that incorporate data mining tech- built off-line over a matter of hours or days. The resulting
niques make their recommendations using knowledge model is very small, very fast, and essentially as accurate
learned from the actions and attributes of users. These as CF methods (Breese, Heckerman, & Kadie, 1998). Baye-
systems are often based on the development of user sian networks may prove practical for environments in
profiles that can be persistent (based on demographic or which knowledge of consumer preferences changes slowly
item consumption history data), ephemeral (based on with respect to the time needed to build the model but are
the actions during the current session), or both. These not suitable for environments in which consumer prefer-
algorithms include clustering, classification techniques, ence models must be updated rapidly or frequently.
the generation of association rules, and the production of Classifiers have been quite successful in a variety of
similarity graphs through techniques such as Horting. domains ranging from the identification of fraud and
Clustering techniques work by identifying groups of credit risks in financial transactions to medical diagnosis
consumers who appear to have similar preferences. Once to intrusion detection. Good et al. (1999) implemented
the clusters are created, averaging the opinions of the induction-learned feature-vector classification of movies
other consumers in her cluster can be used to make and compared the classification with CF recommenda-
predictions for an individual. Some clustering techniques tions; this study found that the classifiers did not perform
represent each user with partial participation in several as well as CF, but that combining the two added value over
clusters. The prediction is then an average across the CF alone.
clusters, weighted by degree of participation. Clustering One of the best-known examples of data mining in
techniques usually produce less-personal recommenda- recommender systems is the discovery of association
tions than other methods, and in some cases, the clusters rules, or item-to-item correlations (Sarwar et. al., 2001).
have worse accuracy than CF-based algorithms (Breese, These techniques identify items frequently found in as-
Heckerman, & Kadie, 1998). Once the clustering is com- sociation with items in which a user has expressed
plete, however, performance can be very good, since the interest. Association may be based on co-purchase data,
size of the group that must be analyzed is much smaller. preference by common users, or other measures. In its
Clustering techniques can also be applied as a first step simplest implementation, item-to-item correlation can be
for shrinking the candidate set in a CF-based algorithm or used to identify matching items for a single item, such
for distributing neighbor computations across several as other clothing items that are commonly purchased with
recommender engines. While dividing the population into a pair of pants. More powerful systems match an entire set
clusters may hurt the accuracy of recommendations to of items, such as those in a customers shopping cart, to
users near the fringes of their assigned cluster, pre- identify appropriate items to recommend. These rules can
clustering may be a worthwhile trade-off between accu- also help a merchandiser arrange products so that, for
racy and throughput. example, a consumer purchasing a childs handheld video
Classifiers are general computational models for as- game sees batteries nearby. More sophisticated temporal
signing a category to an input. The inputs may be vectors data mining may suggest that a consumer who buys the

45

TEAM LinG
The Application of Data Mining to Recommender Systems

video game today is likely to buy a pair of earplugs in the traditional CF algorithms do not consider. In one study
next month. using synthetic data, Horting produced better predic-
Item-to-item correlation recommender applications tions than a CF-based algorithm (Wolf et al., 1999).
usually use current interest rather than long-term cus-
tomer history, which makes them particularly well suited
for ephemeral needs such as recommending gifts or locat- FUTURE TRENDS
ing documents on a topic of short lived interest. A user
merely needs to identify one or more starter items to elicit As data mining algorithms have been tested and vali-
recommendations tailored to the present rather than the dated in their application to recommender systems, a
past. variety of promising applications have evolved. In this
Association rules have been used for many years in section we will consider three of these applications
merchandising, both to analyze patterns of preference meta-recommenders, social data mining systems, and
across products, and to recommend products to consum- temporal systems that recommend when rather than what.
ers based on other products they have selected. An asso- Meta-recommenders are systems that allow users to
ciation rule expresses the relationship that one product is personalize the merging of recommendations from a va-
often purchased along with other products. The number of riety of recommendation sources employing any number
possible association rules grows exponentially with the of recommendation techniques. In doing so, these sys-
number of products in a rule, but constraints on confi- tems let users take advantage of the strengths of each
dence and support, combined with algorithms that build different recommendation method. The SmartPad super-
association rules with itemsets of n items from rules with market product recommender system (Lawrence et al.,
n-1 item itemsets, reduce the effective search space. As- 2001) suggests new or previously unpurchased prod-
sociation rules can form a very compact representation of ucts to shoppers creating shopping lists on a personal
preference data that may improve efficiency of storage as digital assistant (PDA). The SmartPad system considers
well as performance. They are more commonly used for a consumers purchases across a stores product tax-
larger populations rather than for individual consumers, onomy. Recommendations of product subclasses are
and they, like other learning methods that first build and based upon a combination of class and subclass associa-
then apply models, are less suitable for applications where tions drawn from information filtering and co-purchase
knowledge of preferences changes rapidly. Association rules drawn from data mining. Product rankings within a
rules have been particularly successfully in broad applica- product subclass are based upon the products sales
tions such as shelf layout in retail stores. By contrast, rankings within the users consumer cluster, a less per-
recommender systems based on CF techniques are easier sonalized variation of collaborative filtering. MetaLens
to implement for personal recommendation in a domain (Schafer et al., 2002) allows users to blend content re-
where consumer opinions are frequently added, such as quirements with personality profiles to allow users to
online retail. determine which movie they should see. It does so by
In addition to use in commerce, association rules have merging more persistent and personalized recommenda-
become powerful tools in recommendation applications in tions, with ephemeral content needs such as the lack of
the domain of knowledge management. Such systems offensive content or the need to be home by a certain
attempt to predict which Web page or document can be time. More importantly, it allows the user to customize
most useful to a user. As Gry (2003) writes, The problem the process by weighting the importance of each indi-
of finding Web pages visited together is similar to finding vidual recommendation.
associations among itemsets in transaction databases. While a traditional CF-based recommender typically
Once transactions have been identified, each of them requires users to provide explicit feedback, a social data
could represent a basket, and each web resource an item. mining system attempts to mine the social activity records
Systems built on this approach have been demonstrated to of a community of users to implicitly extract the impor-
produce both high accuracy and precision in the coverage tance of individuals and documents. Such activity may
of documents recommended (Geyer-Schultz et al., 2002). include Usenet messages, system usage history, cita-
Horting is a graph-based technique in which nodes are tions, or hyperlinks. TopicShop (Amento et al., 2003) is
users, and edges between nodes indicate degree of simi- an information workspace which allows groups of com-
larity between two users (Wolf et al., 1999). Predictions are mon Web sites to be explored, organized into user de-
produced by walking the graph to nearby nodes and fined collections, manipulated to extract and order com-
combining the opinions of the nearby users. Horting dif- mon features, and annotated by one or more users. These
fers from collaborative filtering as the graph may be walked actions on their own may not be of large interest, but the
through other consumers who have not rated the product collection of these actions can be mined by TopicShop
in question, thus exploring transitive relationships that and redistributed to other users to suggest sites of

46

TEAM LinG
The Application of Data Mining to Recommender Systems

general and personal interest. Agrawal et al. (2003) ex- that data mining algorithms can be and will continue to be
plored the threads of newsgroups to identify the relation- an important part of the recommendation process. A
ships between community members. Interestingly, they
concluded that due to the nature of newsgroup postings
users are more likely to respond to those with whom they REFERENCES
disagree links between users are more likely to sug-
gest that users should be placed in differing partitions Adomavicius, G., & Tuzhilin, A. (2001). Extending
rather than the same partition. Although this technique recommender systems: A multidimensional approach.
has not been directly applied to the construction of IJCAI-01 Workshop on Intelligent Techniques for Web
recommendations, such an application seems a logical Personalization (ITWP2001), Seattle, Washington.
field of future study.
Although traditional recommenders suggest what item Agrawal, R., Rajagopalan, S., Srikant, R., & Xu, Y. (2003).
a user should consume they have tended to ignore changes Mining newsgroups using networks arising from social
over time. Temporal recommenders apply data mining behavior. In Proceedings of the Twelfth World Wide
techniques to suggest when a recommendation should be Web Conference (WWW12) (pp. 529-535), Budapest, Hun-
made or when a user should consume an item. Adomavicius gary.
and Tuzhilin (2001) suggest the construction of a recom- Amento, B., Terveen, L., Hill, W., Hix, D., & Schulman, R.
mendation warehouse, which stores ratings in a hypercube. (2003). Experiments in social data mining: The TopicShop
This multidimensional structure can store data on not System. ACM Transactions on Computer-Human Inter-
only the traditional user and item axes, but also for action, 10 (1), 54-85.
additional profile dimensions such as time. Through this
approach, queries can be expanded from the traditional Breese, J., Heckerman, D., & Kadie, C. (1998). Empirical
what items should we suggest to user X to at what analysis of predictive algorithms for collaborative filter-
times would user X be most receptive to recommendations ing. In Proceedings of the 14th Conference on Uncer-
for product Y. Hamlet (Etzioni et al., 2003) is designed tainty in Artificial Intelligence (UAI-98) (pp. 43-52),
to minimize the purchase price of airplane tickets. Hamlet Madison, Wisconsin.
combines the results from time series analysis, Q-learn-
ing, and the Ripper algorithm to create a multi-strategy Etzioni, O., Knoblock, C.A., Tuchinda, R., & Yates, A.
data-mining algorithm. By watching for trends in airline (2003). To buy or not to buy: Mining airfare data to
pricing and suggesting when a ticket should be pur- minimize ticket purchase price. In Proceedings of the
chased, Hamlet was able to save the average user 23.8% Ninth ACM SIGKDD International Conference on Knowl-
when savings was possible. edge Discovery and Data Mining (pp. 119-128), Wash-
ington. D.C.
Gry, M., & Haddad, H. (2003). Evaluation of Web usage
CONCLUSION mining approaches for users next request prediction. In
Fifth International Workshop on Web Information and
Recommender systems have emerged as powerful tools Data Management (pp. 74-81), Madison, Wisconsin.
for helping users find and evaluate items of interest.
These systems use a variety of techniques to help users Geyer-Schulz, A., & Hahsler, M. (2002). Evaluation of
identify the items that best fit their tastes or needs. While recommender algorithms for an Internet information bro-
popular CF-based algorithms continue to produce mean- ker based on simple association rules and on the repeat-
ingful, personalized results in a variety of domains, data buying theory. In Fourth WEBKDD Workshop: Web Min-
mining techniques are increasingly being used in both ing for Usage Patterns & User Profiles (pp. 100-114),
hybrid systems, to improve recommendations in previ- Edmonton, Alberta, Canada.
ously successful applications, and in stand-alone Good, N. et al. (1999). Combining collaborative filtering
recommenders, to produce accurate recommendations in with personal agents for better recommendations. In Pro-
previously challenging domains. The use of data mining ceedings of Sixteenth National Conference on Artificial
algorithms has also changed the types of recommenda- Intelligence (AAAI-99) (pp. 439-446), Orlando, Florida.
tions as applications move from recommending what to
consume to also recommending when to consume. While Herlocker, J., Konstan, J.A., Borchers, A., & Riedl, J.
recommender systems may have started as largely a (1999). An algorithmic framework for performing collabo-
passing novelty, they clearly appear to have moved into rative filtering. In Proceedings of the 1999 Conference on
a real and powerful tool in a variety of applications, and Research and Development in Information Retrieval,
(pp. 230-237), Berkeley, California.

47

TEAM LinG
The Application of Data Mining to Recommender Systems

Lawrence, R.D. et al. (2001). Personalization of supermar- KEY TERMS


ket product recommendations. Data Mining and Knowl-
edge Discovery, 5(1/2), 11-32. Association Rules: Used to associate items in a data-
Lin, W., Alvarez, S.A., & Ruiz, C. (2002). Efficient adap- base sharing some relationship (e.g., co-purchase infor-
tive-support association rule mining for recommender mation). Often takes the for if this, then that, such as,
systems. Data Mining and Knowledge Discovery, 6(1) If the customer buys a handheld videogame then the
83-105. customer is likely to purchase batteries.

Resnick, P., & Varian, H.R. (1997). Communications of the Collaborative Filtering: Selecting content based on
Association of Computing Machinery Special issue on the preferences of people with similar interests.
Recommender Systems, 40(3), 56-89. Meta-Recommenders: Provide users with personal-
Sarwar, B., Karypis, G., Konstan, J.A., & Reidl, J. (2001). ized control over the generation of a single recommenda-
Item-based collaborative filtering recommendation algo- tion list formed from the combination of rich recommenda-
rithms. In Proceedings of the Tenth International Con- tion data from multiple information sources and recom-
ference on World Wide Web (pp. 285-295), Hong Kong. mendation techniques.

Schafer, J.B., Konstan, J.A., & Riedl, J. (2001). E-Com- Nearest-Neighbor Algorithm: A recommendation
merce Recommendation Applications. Data Mining and algorithm that calculates the distance between users
Knowledge Discovery, 5(1/2), 115-153. based on the degree of correlations between scores in the
users preference histories. Predictions of how much a
Schafer, J.B., Konstan, J.A., & Riedl, J. (2002). Meta- user will like an item are computed by taking the weighted
recommendation systems: User-controlled integration of average of the opinions of a set of nearest neighbors for
diverse recommendations. In Proceedings of the Elev- that item.
enth Conference on Information and Knowledge (CIKM-
02) (pp. 196-203), McLean, Virginia. Recommender Systems: Any system that provides a
recommendation, prediction, opinion, or user-configured
Shoemaker, C., & Ruiz, C. (2003). Association rule mining list of items that assists the user in evaluating items.
algorithms for set-valued data. Lecture Notes in Com-
puter Science, 2690, 669-676. Social Data-Mining: Analysis and redistribution of
information from records of social activity such as
Wolf, J., Aggarwal, C., Wu, K-L., & Yu, P. (1999). Horting newsgroup postings, hyperlinks, or system usage his-
hatches an egg: A new graph-theoretic approach to col- tory.
laborative filtering. In Proceedings of ACM SIGKDD
International Conference on Knowledge Discovery & Temporal Recommenders: Recommenders that incor-
Data Mining (pp. 201-212), San Diego, CA. porate time into the recommendation process. Time can be
either an input to the recommendation function, or the
output of the function.

48

TEAM LinG
49

Approximate Range Queries by Histograms A


in OLAP
Francesco Buccafurri
University Mediterranea of Reggio Calabria, Italy

Gianluca Lax
University Mediterranea of Reggio Calabria, Italy

INTRODUCTION Ramakrishnan, 2000). The main advantage of sampling


techniques is that they are very easy to implement.
Online analytical processing applications typically ana- Besides sampling, regression techniques try to model
lyze a large amount of data by means of repetitive data as a function in such a way that only a small set of
queries involving aggregate measures on such data. In coefficients representing such a function is stored,
fast OLAP applications, it is often advantageous to rather than the original data. The simplest regression
provide approximate answers to queries in order to technique is the linear one, which models a data distri-
achieve very high performances. A way to obtain this bution as a linear function. Despite its simplicity, not
goal is by submitting queries on compressed data in allowing the capture of complex relationships among
place of the original ones. Histograms, initially intro- data, this technique often produces acceptable results.
duced in the field of query optimization, represent one There are also non linear regressions, significantly
of the most important techniques used in the context of more complex than the linear one from the computa-
OLAP for producing approximate query answers. tional point of view, but applicable to a much larger set
of cases.
Another possibility for facing the range query esti-
BACKGROUND mation problem consists of using wavelets-based tech-
niques (Chakrabarti, Garofalakis, Rastogi & Shim, 2001;
Computing aggregate information is a widely exploited Garofalakis & Gibbons, 2002; Garofalakis & Kumar,
task in many OLAP applications. Every time it is neces- 2004). Wavelets are mathematical transformations stor-
sary to produce fast query answers and a certain estima- ing data in a compact and hierarchical fashion used in
tion error can be accepted, it is possible to inquire many application contexts, like image and signal pro-
summary data rather than the original ones and to per- cessing (Kacha, Grenez, De Doncker & Benmahammed,
form suitable interpolations. The typical OLAP query is 2003; Khalifa, 2003). There are several types of trans-
the range query. formations, each belonging to a family of wavelets. The
The range query estimation problem in the one- result of each transformation is a set of values, called
dimensional case can be stated as follows: given an wavelet coefficients. The advantage of this technique is
attribute X of a relation R, and a range I belonging to the that, typically, the value of a (possibly large) number of
domain of X, estimate the number of records of R with wavelet coefficients results to be below a fixed thresh-
value of X lying in I. The challenge is finding methods old, so that such coefficients can be approximated by 0.
for achieving a small estimation error by consuming a Clearly, the overall approximation of the technique as
fixed amount of storage space. well as the compression ratio depends on the value of
A possible solution to this problem is using sampling such a threshold. In the last years, wavelets have been
methods; only a small number of suitably selected records exploited in data mining and knowledge discovery in
of R, representing R well, are stored. The range query is databases, thanks to time and space efficiency and data
then evaluated by exploiting this sample instead of the hierarchical decomposition characterizing them. For a
full relation R. Recently, Wu, Agrawal, and Abbadi (2002) deeper treatment about wavelets, see Li, Li, Zhu, and
have shown that in terms of accuracy, sampling tech- Ogihara (2002).
niques based on the cumulative distribution function are Besides sampling and wavelets, histograms are used
definitely better than the methods based on tuple sam- widely for estimating range queries. Although some-
pling (Chaudhuri, Das & Narasayya, 2001; Ganti, Lee & times wavelets are viewed as a particular class of histo-
grams, we prefer to describe histograms separately.

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Approximate Range Queries by Histograms in OLAP

MAIN THRUST we first consider the Max-Diff histogram and the V-


Optimal histogram. Even though they are not the most
Histograms are a lossy compression technique widely recent techniques, we cite them, since they are still
applied in various application contexts, like query opti- considered points of reference.
mization, statistical and temporal databases, and OLAP We start by describing the Max-Diff histogram.
applications. In OLAP, compression allows us to obtain Let V={v1, ... , v n}be the set of values of the attribute
fast approximate answers by evaluating queries on re- X actually appearing in the relation R and f(v i) be the
duced data in place of the original ones. Histograms are number of tuples of R having value v i in X. A MaxDiff
well suited to this purpose, especially in the case of histogram with h buckets is obtained by putting a bound-
range queries. ary between two adjacent attribute values v i and vi+1 of V
A histogram is a compact representation of a rela- if the difference between f(vi+1) si+1 and f(vi) s i is one
tion R. It is obtained by partitioning an attribute X of the of the h-1 largest such differences (where s i denotes the
relation R into k subranges, called buckets, and by spread of vi, that is the distance from vi to the next non-
maintaining for each of them a few pieces of informa- null value).
tion, typically corresponding to the bucket boundaries, A V-Optimal histogram, which is the other classical
the number of tuples with value of X belonging to the histogram we describe, produces more precise results
subrange associated to the bucket (often called sum of than the Max-Diff histogram. It is obtained by selecting
the bucket), and the number of distinct values of X of the boundaries for each bucket i so that i SSEi is
such a subrange occurring in some tuple of R (i.e., the minimal, where SSE i is the standard squared error of the
number of non-null frequencies of the subrange). bucket i-th.
Recall that a range query, defined on an interval I of V-Optimal histogram uses a dynamic programming
X, evaluates the number of occurrences in R with value technique in order to find the optimal partitioning w.r.t.
of X in I. Thus, buckets embed a set of precomputed a given error metrics. Even though the V-Optimal histo-
disjoint range queries capable of covering the whole gram results more accurate than Max-Diff, its high
active domain of X in R (here, active means attribute space and time complexities make it rarely used in
values actually appearing in R). As a consequence, the practice.
histogram, in general, does not give the possibility of In order to overcome such a drawback, an approxi-
evaluating exactly a range query not corresponding to mate version of the V-Optimal histogram has been pro-
one of the precomputed embedded queries. In other posed. The basic idea is quite simple. First, data are
words, while the contribution to the answer coming partitioned into l disjoint chunks, and then the V-Opti-
from the subranges coinciding with entire buckets can mal algorithm is used in order to compute a histogram
be returned exactly, the contribution coming from the within each chunk. The consequent problem is how to
subranges that partially overlap buckets can only be allocate buckets to the chunks such that exactly B buck-
estimated, since the actual data distribution is not avail- ets are used. This is solved by implementing a dynamic
able. programming scheme. It is shown that an approximate
Constructing the best histogram thus may mean de- V-Optimal histogram with B + l buckets has the same
fining the boundaries of buckets in such a way that the accuracy as the non-approximate V-Optimal with B buck-
estimation of the non-precomputed range queries be- ets. Moreover, the time required for executing the
comes more effective (e.g., by avoiding that large fre- approximate algorithm is reduced by multiplicative fac-
quency differences arise inside a bucket). This approach tor equal to1/l.
corresponds to finding, among all possible sets of pre- We call the histograms so far described classical
computed range queries, the set that guarantees the best histograms.
estimation of the other (non-precomputed) queries, Besides accuracy, new histograms tend to satisfy
once a technique for estimating such queries is defined. other properties in order to allow their application to
Besides this problem, which we call the partition new environments (e.g., knowledge discovery). In par-
problem, there is another relevant issue to investigate: ticular, (1) the histogram should maintain in a certain
how to improve the estimation inside the buckets. We measure the semantic nature of original data, in such a
discuss both of these issues in the following two sections. way that meaningful queries for mining activities can be
submitted to reduced data in place of original ones.
The Partition Problem Then, (2) for a given kind of query, the accuracy of the
reduced structure should be guaranteed. In addition, (3)
This issue has been analyzed widely in the past, and a the histogram should efficiently support hierarchical
number of techniques has been proposed. Among these, range queries in order not to limit too much the capabil-
ity of drilling down and rolling up over data.

50

TEAM LinG
Approximate Range Queries by Histograms in OLAP

Classical histograms lack the last point, since they one or more buckets can be computed exactly, while if it
are flat structures. Many proposals have been presented partially overlaps a bucket, then the result only can be A
in order to guarantee the three properties previously estimated.
described, and we report some of them in the following. The simplest adopted estimation technique is the
Requirement (3) was introduced by Koudas, Continuous Value Assumption (CVA). Given a bucket
Muthukrishnan, and Srivastava (2000), where the authors of size s and sum c, a range query overlapping the bucket
have shown the insufficient accuracy of classical histo- in i points is estimated as (i / s ) c . This corresponds to
grams in evaluating hierarchical range queries. Therein, a estimating the partial contribution of the bucket to the
polynomial-time algorithm for constructing optimal his- range query result by linear interpolation.
tograms with respect to hierarchical queries is proposed. Another possibility is to use the Uniform Spread
The selectivity estimation problem for non-hierarchi- Assumption (USA). It assumes that values are distrib-
cal range queries was studied by Gilbert, Kotidis, uted at equal distance from each other and that the
Muthukrishnan, and Strauss (2001), and, according to overall frequency sum is equally distributed among
property (2), optimal and approximate polynomial (in the them. In this case, it is necessary to know the number of
database size) algorithms with a provable approximation non-null frequencies belonging to the bucket. Denoting
guarantee for constructing histograms are also presented. by t such a value, the range query is estimated by
Guha, Koudas, and Srivastava (2002) have proposed
efficient algorithms for the problem of approximating
the distribution of measure attributes organized into (s 1) + (i 1) (t 1) c
.
hierarchies. Such algorithms are based on dynamic pro- ( s 1) t
gramming and on a notion of sparse intervals.
Algorithms returning both optimal and suboptimal An interesting problem is understanding whether,
solutions for approximating range queries by histograms by exploiting information typically contained in histo-
and their dynamic maintenance by additive changes are gram buckets and possibly by adding some concise
provided by Muthukrishnan and Strauss (2003). The best summary information, the frequency estimation inside
algorithm, with respect to the construction time, return- buckets and, then, the histogram accuracy can be im-
ing an optimal solution takes polynomial time. proved. To this aim, starting from a theoretical analysis
Buccafurri and Lax (2003) have presented a histo- about limits of CVA and USA, Buccafurri, Pontieri,
gram based on a hierarchical decomposition of the data Rosaci, and Sacc (2002) have proposed to use an addi-
distribution kept in a full binary tree. Such a tree, con- tional storage space of 32 bits, called 4LT, in each
taining a set of precomputed hierarchical queries, is bucket in order to store the approximate representation
encoded by using bit saving for obtaining a smaller of the data distribution inside the bucket. In particular, 4LT
structure and, thus, for efficiently supporting hierarchi- is used to save approximate cumulative frequencies at
cal range queries. seven equidistant intervals internal to the bucket.
Besides bucket-based histograms, there are other Clearly, approaches similar to that followed in
kinds of histograms whose construction is not driven by Buccafurri, Pontieri, Rosaci, and Sacc (2002) have to
the search of a suitable partition of the attribute domain, deal with the trade-off between the extra storage space
and, further, their structure is more complex than simply required for each bucket and the number of total buck-
a set of buckets. This class of histograms is called non- ets the allowed total storage space consents.
bucket based histograms. Wavelets are an example of
such kind of histograms.
In the next section, we deal with the second problem FUTURE TRENDS
introduced earlier concerning the estimation of range
queries partially involving buckets. Data streams is an emergent issue that in the last two
years has captured the interest of many scientific com-
Estimation Inside a Bucket munities. The crucial problem arising in several appli-
cation contexts like network monitoring, sensor net-
While finding the optimal bucket partition has been works, financial applications, security, telecommuni-
widely investigated in past years, the problem of estimat- cation data management, Web applications, and so on is
ing queries partially involving a bucket has received a dealing with continuous data flows (i.e., data streams)
little attention. having the following characteristics: (1) they are time
Histograms are well suited to range query evaluation, dependent; (2) their size is very large, so that they
since buckets basically correspond to a set of precom- cannot be stored totally due to the actual memory
puted range queries. A range query that involves entirely

51

TEAM LinG
Approximate Range Queries by Histograms in OLAP

limitation; and (3) data arrival is very fast and unpredict- Buccafurri, F., & Lax, G. (2003). Pre-computing approxi-
able, so that each data management operation should be mate hierarchical range queries in a tree-like histogram.
very efficient. Proceedings. of the International Conference on Data
Since a data stream consists of a large amount of Warehousing and Knowledge Discovery.
data, it is usually managed on the basis of a sliding
window, including only the most recent data (Babcock, Buccafurri, F., & Lax, G. (2004). Reducing data stream
Babu, Datar, Motwani & Widom, 2002). Thus, any tech- sliding windows by cyclic tree-like histograms. Pro-
nique capable of compressing sliding windows by main- ceedings of the 8th European Conference on Principles
taining a good approximate representation of data dis- and Practice of Knowledge Discovery in Databases.
tribution is certainly relevant in this field. Typical que- Buccafurri, F., Pontieri, L., Rosaci, D., & Sacc, D.
ries performed on sliding windows are similarity que- (2002). Improving range query estimation on histo-
ries and other analyses, like change mining queries grams. Proceedings of the International Conference
(Dong, Han, Lakshmanan, Pei, Wang & Yu, 2003) useful on Data Engineering.
for trend analysis and, in general, for understanding the
dynamics of data. Also in this field, histograms may Chakrabarti, K., Garofalakis, M., Rastogi, R., & Shim,
become an important analysis tool. The challenge is K. (2001). Approximate query processing using wave-
finding new histograms that (1) are fast to construct and lets. VLDB Journal, The International Journal on
to maintain; that is, the required updating operations Very Large Data Bases, 10(2-3), 199-223.
(performed at each data arrival) are very efficient; (2) Chaudhuri, S., Das, G., & Narasayya, V. (2001). A ro-
maintain a good accuracy in approximating data distri- bust, optimization-based approach for approximate an-
bution; and (3) support continuous querying on data. swering of aggregate queries. Proceedings of the 2001
An example of the above emerging approaches is ACM SIGMOD International Conference on Manage-
reported in Buccafurri and Lax (2004), where a tree-like ment of Data.
histogram with cyclic updating is proposed. By using
such a compact structure, many mining techniques, which Dong, G. et al. (2003). Online mining of changes from data
would take computational cost very high if used on real streams: Research problems and preliminary results. Pro-
data streams, can be implemented effectively. ceedings of the ACM SIGMOD Workshop on Manage-
ment and Processing of Data Streams.
Ganti, V., Lee, M. L., & Ramakrishnan, R. (2000).
CONCLUSION Icicles: Self-tuning samples for approximate query an-
swering. Proceedings of 26th International Confer-
Data reduction represents an important task both in data ence on Very Large Data Bases.
mining task and in OLAP, since it allows us to represent
very large amounts of data in a compact structure, which Garofalakis, M., & Gibbons, P.B. (2002). Wavelet syn-
efficiently perform on mining techniques or OLAP opses with error guarantees. Proceedings of the ACM
queries. Time and memory cost advantages arisen from SIGMOD International Conference on Management
data compression, provided that a sufficient degree of of Data.
accuracy is guaranteed, may improve considerably the
capabilities of mining and OLAP tools. Garofalakis, M., & Kumar, A. (2004). Deterministic wave-
This opportunity (added to the necessity, coming let thresholding for maximum error metrics. Proceedings
from emergent research fields such as data streams) of of the Twenty-third ACM SIGMOD-SIGACT-SIGART Sym-
producing more and more compact representations of posium on Principles of Database Systems.
data explains the attention that the research community Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., & Strauss,
is giving toward techniques like histograms and wave- M.J. (2001). Optimal and approximate computation of
lets, which provide a concrete answer to the previous summary statistics for range aggregates. Proceedings
requirements. of the Twentieth ACM SIGMOD-SIGACT-SIGART Sym-
posium on Principles of Database Systems.

REFERENCES Guha, S., Koudas, N., & Srivastava, D. (2002). Fast


algorithms for hierarchical range histogram construc-
Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, tion. Proceedings of the Twenty-Ffirst ACM SIGMOD-
J. (2002). Models and issues in data stream system. SIGACT-SIGART Symposium on Principles of Data-
Proceedings of the ACM SIGMOD-SIGACT-SIGART base Systems.
Symposium on Principles of Database Systems.

52

TEAM LinG
Approximate Range Queries by Histograms in OLAP

Kacha, A., Grenez, F., De Doncker, P., & Benmahammed, K. Bucket-Based Histogram: A type of histogram whose
(2003). A wavelet-based approach for frequency estima- construction is driven by the search of a suitable partition A
tion of interference signals in printed circuit boards. Pro- of the attribute domain into buckets.
ceedings of the 1st International Symposium on Informa-
tion and Communication Technologies. Continuous Value Assumption (CVA): A tech-
nique allowing us to estimate values inside a bucket by
Khalifa, O. (2003). Image data compression in wavelet linear interpolation.
transform domain using modified LBG algorithm. Pro-
ceedings of the 1st International Symposium on Infor- Data Preprocessing: The application of several
mation and Communication Technologies. methods preceding the mining phase, done for improv-
ing the overall data mining results. Usually, it consists
Koudas, N., Muthukrishnan, S., & Srivastava, D. (2000). of (1) data cleaning, a method for fixing missing values,
Optimal histograms for hierarchical range queries (ex- outliers, and possible inconsistent data; (2) data inte-
tended abstract). Proceedings of the Nineteenth ACM gration, the union of (possibly heterogeneous) data
SIGMOD-SIGACT-SIGART Symposium on Principles coming from different sources into a unique data store;
of Database Systems. and (3) data reduction, the application of any technique
working on data representation capable of saving stor-
Li, T., Li, Q., Zhu, S., & Ogihara, M. (2002). Survey on age space without compromising the possibility of in-
wavelet applications in data mining. ACM SIGKDD Ex- quiring them.
plorations, 4(2), 49-68.
Histogram: A set of buckets implementing a parti-
Muthukrishnan, S., & Strauss, M. (2003). Rangesum tion of the overall domain of a relation attribute.
histograms. Proceedings of the Fourteenth Annual
ACM-SIAM Symposium on Discrete Algorithms. Range Query: A query returning an aggregate infor-
mation (i.e., sum, average) about data belonging to a
Wu, Y., Agrawal, D., & Abbadi, A.E. (2002). Query given interval of the domain.
estimation by adaptive sampling. Proceedings of the
International Conference on Data Engineering. Uniform Spread Assumption (USA): A technique
for estimating values inside a bucket by assuming that
values are distributed at an equal distance from each
other and that the overall frequency sum is distributed
KEY TERMS equally among them.

Bucket: An element obtained by partitioning the Wavelets: Mathematical transformations imple-


domain of an attribute X of a relation into non-overlap- menting hierarchical decomposition of functions lead-
ping intervals. Each bucket consists of a tuple <inf, sup, ing to the representation of functions through sets of
val>, where val is an aggregate information (i.e., sum, wavelet coefficients.
average, count, etc.) about tuples with value of X be-
longing to the interval (inf, sup).

53

TEAM LinG
54

Artificial Neural Networks for Prediction


Rafael Mart
Universitat de Valncia, Spain

INTRODUCTION BACKGROUND

The design and implementation of intelligent systems From a technical point of view, ANNs offer a general
with human capabilities is the starting point to design framework for representing nonlinear mappings from
Artificial Neural Networks (ANNs). The original idea several input variables to several output variables. They
takes after neuroscience theory on how neurons in the are built by tuning a set of parameters known as weights
human brain cooperate to learn from a set of input and can be considered as an extension of the many
signals to produce an answer. Because the power of the conventional mapping techniques. In classification or
brain comes from the number of neurons and the mul- recognition problems, the nets outputs are categories,
tiple connections between them, the basic idea is that while in prediction or approximation problems, they are
connecting a large number of simple elements in a continuous variables. Although this article focuses on
specific way can form an intelligent system. the prediction problem, most of the key issues in the net
Generally speaking, an ANN is a network of many functionality are common to both.
simple processors called units, linked to certain neigh- In the process of training the net (supervised learn-
bors with varying coefficients of connectivity (called ing), the problem is to find the values of the weights w
weights) that represent the strength of these connec- that minimize the error across a set of input/output pairs
tions. The basic unit of ANNs, called an artificial neu- (patterns) called the training set E. For a single output
ron, simulates the basic functions of natural neurons: it and input vector x, the error measure is typically the
receives inputs, processes them by simple combination root mean squared difference between the predicted
and threshold operations, and outputs a final result. output p(x,w) and the actual output value f(x) for all the
ANNs often employ supervised learning in which elements x in E (RMSE); therefore, the training is an
training data (including both the input and the desired unconstrained nonlinear optimization problem, where
output) is provided. Learning basically refers to the the decision variables are the weights, and the objective
process of adjusting the weights to optimize the net- is to reduce the training error. Ideally, the set E is a
work performance. ANNs belongs to machine-learning representative sample of points in the domain of the
algorithms because the changing of a networks connec- function f that you are approximating; however, in prac-
tion weights causes it to gain knowledge in order to tice it is usually a set of points for which you know the
solve the problem at hand. f-value.
Neural networks have been widely used for both
classification and prediction. In this article, I focus on
the prediction or estimation problem (although with ( f ( x) p( x, w)) 2

Min error ( E , w) = xE
(1)
some few changes, my comments and descriptions also w E
apply to classification). Estimating and forecasting fu-
ture conditions are involved in different business activi-
The main goal in the design of an ANN is to obtain a
ties. Some examples include cost estimation, predic-
model that makes good predictions for new inputs (i.e.,
tion of product demand, and financial planning. More-
to provide good generalization). Therefore, the net must
over, the field of prediction also covers other activities,
represent the systematic aspects of the training data
such as medical diagnosis or industrial process model-
rather than their specific details. The standard way to
ing.
measure the generalization provided by the net consists
In this short article I focus on the multilayer neural
of introducing a second set of points in the domain of f
networks because they are the most common. I describe
called the testing set, T. Assume that no point in T
their architecture and some of the most popular training
belongs to E and f(x) is known for all x in T. After the
methods. Then I finish with some associated conclu-
optimization has been performed and the weights have
sions and the appropriate list of references to provide
been set to minimize the error in E (w=w*), the error
some pointers for further study.
across the testing set T is computed (error(T,w*)). The

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Artificial Neural Networks for Prediction

net must exhibit a good fit between the target f-values and restrict our attention to real functions f: n ). Figure
the output (prediction) in the training set and also in the 1 shows a net where NI = 1, 2, ..., n, NH = n+1, n+2,..., A
testing set. If the RMSE in T is significantly higher than n+m and N O = s.
that one in E, you say that the net has memorized the data Given an input pattern x=(x1,...,xn), the neural net-
instead of learning them (i.e., the net has overfitted the work provides the user with an associated output NN(x,w),
training data). which is a function of the weights w. Each node i in the
The optimization of the function given in (1) is a hard input layer receives a signal of amount xi that it sends
problem by itself. Moreover, keep in mind that the final through all its incident arcs to the nodes in the hidden
objective is to obtain a set of weights that provides low layer. Each node n+j in the hidden layer receives a signal
values of error(T,w*) for any set T. In the following input(n+j) according to the expression
sections I summarize some of the most popular and
n
other not so popular but more efficient methods to train
the net (i.e., to compute appropriate weight values). Input(n+j)=wn+j + x w
i =1
i i ,n + j

where wn+j is the bias value for node n+j, and wi,n+j is the
MAIN THRUST weight value on the arc from node i in the input layer to
node n+j in the hidden layer. Each hidden node trans-
Several models inspired by biological neural networks forms its input by means of a nonlinear activation func-
have been proposed throughout the years, beginning tion: output(j)=sig(input(j)). The most popular choice
with the perceptron introduced by Rosenblatt (1962). for the activation function is the sigmoid function
He studied a simple architecture where the output of the sig(x)= 1/(1+e-x). Laguna and Mart (2002) test two
net is a transformation of a linear combination of the activation functions for the hidden neurons and con-
input variables and the weights. Minskey and Papert clude that the sigmoid presents superior performance.
(1969) showed that the perceptron can only solve lin- Each hidden node n+j sends the amount of signal
early separable classification problems and is therefore output(n+j) through the arc (n+j,s). The node s in the
of limited interest. A natural extension to overcome its output layer receives the weighted sum of the values
limitations is given by the so-called multilayer- coming from the hidden nodes. This sum, NN(x,w), is the
perceptron, or, simply, multilayer neural networks. I nets output according to the expression:
have considered this architecture with a single hidden
m
layer. A schematic representation of the network ap-
pears in Figure 1. NN(x,w) = ws + output (n + j) w
j =1
n+ j ,s

Neural Network Architecture In the process of training the net (supervised learn-
ing), the problem is to find the values of the weights
Let NN=(N, A) be an ANN where N is the set of nodes and (including the bias factors) that minimize the error
A is the set of arcs. N is partitioned into three subsets: (RMSE) across the training set E. After the optimization
NI, input nodes, NH, hidden nodes, and NO, output nodes. has been performed and the weights have been set
I assume that n variables exist in the function that I want (w=w*),the net is ready to produce the output for any
to predict or approximate, therefore |N I|= n. The neural input value. The testing error Error(T,w*) computes the
network has m hidden neurons (|N H|= m) with a bias term Root Mean Squared Error across the elements in the
in each hidden neuron and a single output neuron (we testing set T={y1, y2,..,ys}, where no one belongs to the
training set E:

Figure 1. Neural network diagram s

inp u ts w 1, n+ 1 error ( y , w ) i

Error(T,w*) = i =1 .
x1 1 n+1 s
w n+ 1 ,s
ou tp ut
Training Methods
x2 2 n+2 s

Considering the supervised learning described in the


previous section, many different training methods have
xn n n+m been proposed. Here, I summarize some of the most
relevant, starting with the well-known backpropagation

55

TEAM LinG
Artificial Neural Networks for Prediction

method. For a deeper understanding of them, see the based on strategy can provide useful clues about how
excellent book by Bishop (1995). the strategy may profitably be changed.
Backpropagation (BP) was the first method for neu- As far as I know, the first tabu search approach for
ral network training and is still the most widely used neural network training is due to Sexton et al. (1998).
algorithm in practical applications. It is a gradient de- A short description follows. An initial solution x0 is
scent method that searches for the global optimum of the randomly drawn from a uniform distribution in the
network weights. Each iteration consists of two steps. range [-10,10]. Solutions are randomly generated in
First, partial derivatives Error/ w are computed for this range for a given number of iterations. When
each weight in the net. Then weights are modified to generating a new point xnew, aspiration level and tabu
reduce the RMSE according to the direction given by the conditions are checked. If f(xnew)<f(xbest), then the point
gradient. There have been different modifications to this is automatically accepted and both xbest and f(xbest) are
basic procedure; the most significant is the addition of a updated; otherwise, the tabu conditions are tested. If
momentum term to prevent zigzagging in the search. there is one solution xi in the tabu list (TL) such as
Because the neural network training problem can be f(xnew) [f(xi)-0.01*f(x i), f(xi)+0.01*f(xi)], then the com-
expressed as a nonlinear unconstrained optimization plete test is applied to xnew and xi; otherwise, the point
problem, I might use more elaborated nonlinear methods is accepted. The test checks whether all the weights in
than the gradient descent to solve it. A selection of the x new are within 0.01 from xi in this case the point is
best established algorithms in unconstrained nonlinear rejected; otherwise, the point is accepted, and xnew and
optimization has also been used in this context. Specifi- f(x new) are entered into TL. This process continues for
cally, the nonlinear simplex method, the direction set 1,000 iterations of accepted solutions. Then another
method, the conjugate gradient method, the Levenberg- cycle of 1,000 iterations of random sampling begins.
Marquardt algorithm (More, 1978), and the GRG2 These cycles will continuously repeat while f(x best)
(Smith and Lasdon, 1992). improves.
Recently, metaheuristic methods have also been Mart and El-Fallahi (2004) propose an improved
adapted to this problem. Specifically, on one hand you tabu search method that consists of three phases:
can find those methods based on local search proce- MultiRSimplex, TSProb, and TSFreq. After the initial-
dures, and on the other, those methods based on popula- ization with the MultiRSimplex phase, the procedure
tion of solutions known as evolutionary methods. In the performs iterations in a loop consisting in alternating
first category, two methods have been applied, simulated both phases, TSProb and TSFreq, to intensify and diver-
annealing and tabu search, while in the second you can sify the search respectively. In this work, a computa-
find the so-called genetic algorithms, the scatter search, tional study of 12 methods for neural network training
and, more recently, a path relinking implementation. is presented, including nonlinear and local-search-based
Several studies (Sexton, 1998) have shown that tabu optimizers. Overall, experiments with 45 functions
search outperforms the simulated annealing implemen- from the literature were performed to compare the
tation; therefore, I first focus on the different tabu procedures. The experiments show that some functions
search implementations for ANN training. cannot be approximated with a reasonable accuracy
level when training the net for a limited number of
Tabu Search iterations. The experimentation also shows that the
proposed TS provides, on average, the best solutions
Tabu search (TS) is based on the premise that in order to (best approximations).
qualify as intelligent, problem solving must incorporate
adaptive memory and responsive exploration. The adap- Evolutionary Methods
tive memory feature of TS allows the implementation of
procedures that are capable of searching the solution The idea of applying the biological principle of natural
space economically and effectively. Because local evolution to artificial systems, introduced more than
choices are guided by information collected during the three decades ago, has seen impressive growth in the
search, TS contrasts with memoryless designs that heavily past few years. Evolutionary algorithms have been suc-
rely on semirandom processes that implement a form of cessfully applied to numerous problems from different
sampling. The emphasis on responsive exploration in domains, including optimization, automatic program-
tabu search, whether in a deterministic or probabilistic ming, machine learning, economics, ecology, popula-
implementation, derives from the supposition that a bad tion genetics, studies of evolution and learning, and
strategic choice can yield more information than a good social systems.
random choice. In a system that uses memory, a bad choice A genetic algorithm is an iterative procedure that
consists of a constant-size population of individuals,

56

TEAM LinG
Artificial Neural Networks for Prediction

each represented by a finite string of symbols, known as solutions, a reference set update method to build and
the genome, encoding a possible solution in a given maintain a reference set consisting of the b best solu- A
problem space. This space, referred to as the search tions found, a subset generation method to operate on the
space, comprises all possible solutions to the problem reference set in order to produce a subset of its solutions
at hand. Solutions to a problem were originally encoded as a basis for creating combined solutions, and a solution
as binary strings due to certain computational advan- combination method to transform a given subset of
tages associated with such encoding. Also, the theory solutions produced by the subset generation method into
about the behavior of algorithms was based on binary one or more combined solution vectors. An exhaustive
strings. Because in many instances it is impractical to description of these methods and how they operate can
represent solutions by using binary strings, the solution be found in Laguna and Mart (2003).
representation has been extended in recent years to Laguna and Mart (2002) proposed a three-step Scat-
include character-based encoding, real-valued encod- ter Search algorithm for ANNs. El-Fallahi, Mart, and
ing, and tree representations. Lasdon (in press) propose a new training method based
The standard genetic algorithm proceeds as follows. on the path relinking methodology. Path relinking starts
An initial population of individuals is generated at ran- from a given set of elite solutions obtained during a
dom or heuristically. Every evolutionary step, known as previous search process. Path relinking and its cousin,
a generation, the individuals in the current population Scatter Search, are mainly based on two elements: com-
are decoded and evaluated according to some predefined binations and local search. Path relinking generalizes
quality criterion, referred to as the fitness, or fitness the concept of combination beyond its usual application
function. To form a new population (the next genera- to consider paths between solutions. Local search, per-
tion), individuals are selected according to their fitness. formed now with the GRG2 optimizer, intensifies the
Many selection procedures are currently in use, one of search by seeking local optima. The paper shows an
the simplest being Hollands original fitness-propor- empirical comparison of the proposed method with the
tionate selection, where individuals are selected with a best previous evolutionary approaches, and the associ-
probability proportional to their relative fitness. This ated experiments show the superiority of the new method
ensures that the expected number of times an individual in terms of solution quality (prediction accuracy). On
is chosen is approximately proportional to its relative the other hand, these experiments confirm again that a
performance in the population. Thus, high-fitness few functions cannot be approximated with any of the
(good) individuals stand a better chance of reproduc- current training methods.
ing, while low-fitness ones are more likely to disappear.
In terms of ANN training, a solution (or individual)
consists of an array with the nets weights and its asso- FUTURE TRENDS
ciated fitness is usually the RMSE obtained with this
solution in the training set. You can find a lot of research An open problem in the context of prediction is to
in GA implementations to ANNs. Consider, for in- compare ANNs with some modern approximation tech-
stance, the recent work by Alba and Chicano (2004), in niques developed in statistics. Specifically, the non-
which a hybrid GA is proposed. Here, the hybridization parametric additive models and local regression can
refers to the inclusion of problem-dependent knowl- also offer good solutions to the general approximation
edge in a general search template. The hybrid algorithms or prediction problem. The development of hybrid sys-
used in this work are combinations of two algorithms tems from both technologies could give the starting
(weak hybridization), where one of them acts as an point for a new generation of prediction systems.
operator in the other. This kind of combinations has
produced the most successful training methods in the
last few years. The authors proposed here the combina- CONCLUSION
tion of GA with the BP algorithm as well as GA with the
Levenberg-Marquardt for training ANNs. In this work I revise the most representative methods for
Scatter search (SS) was first introduced in Glover neural network training. Several computational studies
(1977) as a heuristic for integer programming. The with some of these methods reveal that the best results
following template is a standard for implementing scat- are achieved with a combination of a metaheuristic
ter search that consists of five methods. A diversifica- procedure with a nonlinear optimizer. These experi-
tion generation method to generate a collection of ments also show that from a practical point of view,
diverse trial solutions, an improvement method to trans- some functions cannot be approximated.
form a trial solution into one or more enhanced trial

57

TEAM LinG
Artificial Neural Networks for Prediction

REFERENCES Sexton. (1998). Global optimization for artificial neural


networks: A tabu search application. European Journal
Alba, E., & Chicano, J. F. (2004). Training neural networks of Operational Research, 106, 570-584.
with GA Hybrid algorithms. In K. Deb (Ed.), Proceedings Sexton, R. S., Dorsey, R. E., & Johnson, J. D. (1999).
of the Genetic and Evolutionary Computation Confer- Optimization of neural networks: A comparative analy-
ence, USA. sis of the genetic algorithm and simulated annealing.
Bishop, C. M. (1995). Neural networks for pattern European Journal of Operational Research, 114, 589-601.
recognition. New York: Oxford University Press. Smith, S., & Lasdon, L. (1992). Solving large nonlinear
El-Fallahi, A., Mart, R., & Lasdon, L. (in press). Path programs using GRG. ORSA Journal on Computing,
relinking and GRG for artificial neural networks. Euro- 4(1), 2-15.
pean Journal of Operational Research.
Glover, F. (1977). Heuristics for integer programming
using surrogate constraints. Decision Sciences, 8, 156-166. KEY TERMS
Glover, F., & Laguna, M. (1993). Tabu search. In C. Reeves Classification: Also known as a recognition prob-
(Ed.), Heuristic techniques for combinatorial problems lem; the identification of the class to which a given
(pp. 70-150). object belongs.
Laguna, M., & Mart, R. (2002). Neural network predic- Genetic Algorithm: An iterative procedure that
tion in a system for optimizing simulations. IIE Trans- consists of a constant-size population of individuals,
actions, 34(3), 273-282. each represented by a finite string of symbols, known as
Laguna, M., & Mart, R. (2003). Scatter search: Meth- the genome, encoding a possible solution in a given
odology and implementations in C. Kluwer Academic. problem space.

Mart, R., & El-Fallahi, A. (2004). Multilayer neural Metaheuristic: A master strategy that guides and
networks: An experimental evaluation of on-line train- modifies other heuristics to produce solutions beyond
ing methods. Computers and Operations Research, those that are normally generated in a quest for local
31, 1491-1513. optimality.

Mart, R., Laguna, M., & Glover, F. (in press). Principles Network Training: The process of finding the val-
of scatter search. European Journal of Operational ues of the network weights that minimize the error
Research. across a set of input/output pairs (patterns) called the
training set.
Minsky, M. L., & Papert, S.A. (1969). Perceptrons
(Expanded ed.). Cambridge, MA: MIT Press. Optimization: The quantitative study of optima and
the methods for finding them.
More, J. J. (1978). The Levenberg-Marquardt algo-
rithm: Implementation and theory. In G. Watson (Ed.), Prediction: Consists of approximating unknown
Lecture Notes in Mathematics: Vol. 630. functions. The nets input is the values of the function
variables, and the output is the estimation of the func-
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & tion image.
Flannery, B. P. (1992). Numerical recipes: The art of
scientific computing. Cambridge, MA: Cambridge Uni- Scatter Search: A metaheuristic that belongs to the
versity Press. evolutionary methods.

Rosemblatt, F. (1962). Principles of neurodynamics: Tabu Search: A metaheuristic procedure based on


Perceptrons and theory of brain mechanisms. Washing- principles of intelligent search. Its premise is that prob-
ton, DC: Spartan. lem solving, in order to qualify as intelligent, must incor-
porate adaptive memory and responsive exploration.

58

TEAM LinG
59

Association Rule Mining A


Yew-Kwong Woon
Nanyang Technological University, Singapore

Wee-Keong Ng
Nanyang Technological University, Singapore

Ee-Peng Lim
Nanyang Technological University, Singapore

INTRODUCTION BACKGROUND

Association Rule Mining (ARM) is concerned with how Recently, a new class of problems emerged to challenge
items in a transactional database are grouped together. It ARM researchers: Incoming data is streaming in too
is commonly known as market basket analysis, because fast and changing too rapidly in an unordered and un-
it can be likened to the analysis of items that are fre- bounded manner. This new phenomenon is termed data
quently put together in a basket by shoppers in a market. stream (Babcock, Babu, Datar, Motwani, & Widom,
From a statistical point of view, it is a semiautomatic 2002).
technique to discover correlations among a set of vari- One major area where the data stream phenomenon
ables. is prevalent is the World Wide Web (Web). A good
ARM is widely used in myriad applications, includ- example is an online bookstore, where customers can
ing recommender systems (Lawrence, Almasi, Kotlyar, purchase books from all over the world at any time. As
Viveros, & Duri, 2001), promotional bundling (Wang, a result, its transactional database grows at a fast rate
Zhou, & Han, 2002), Customer Relationship Manage- and presents a scalability problem for ARM. Traditional
ment (CRM) (Elliott, Scionti, & Page, 2003), and cross- ARM algorithms, such as Apriori, were not designed to
selling (Brijs, Swinnen, Vanhoof, & Wets, 1999). In handle large databases that change frequently (Agrawal
addition, its concepts have also been integrated into & Srikant, 1994). Each time a new transaction arrives,
other mining tasks, such as Web usage mining (Woon, Apriori needs to be restarted from scratch to perform
Ng, & Lim, 2002), clustering (Yiu & Mamoulis, 2003), ARM. Hence, it is clear that in order to conduct ARM on
outlier detection (Woon, Li, Ng, & Lu, 2003), and the latest state of the database in a timely manner, an
classification (Dong & Li, 1999), for improved effi- incremental mechanism to take into consideration the
ciency and effectiveness. latest transaction must be in place.
CRM benefits greatly from ARM as it helps in the In fact, a host of incremental algorithms have already
understanding of customer behavior (Elliott et al., 2003). been introduced to mine association rules incremen-
Marketing managers can use association rules of prod- tally (Sarda & Srinivas, 1998). However, they are only
ucts to develop joint marketing campaigns to acquire incremental to a certain extent; the moment the univer-
new customers. The application of ARM for the cross- sal itemset (the number of unique items in a database)
selling of supermarket products has been successfully (Woon, Ng, & Das, 2001) is changed, they have to be
attempted in many cases (Brijs et al., 1999). In one restarted from scratch. The universal itemset of any
particular study involving the personalization of super- online store would certainly be changed frequently,
market product recommendations, ARM has been ap- because the store needs to introduce new products and
plied with much success (Lawrence et al., 2001). To- retire old ones for competitiveness. Moreover, such
gether with customer segmentation, ARM helped to incremental ARM algorithms are efficient only when
increase revenue by 1.8%. the database has not changed much since the last mining.
In the biology domain, ARM is used to extract novel The use of data structures in ARM, particularly the
knowledge on protein-protein interactions (Oyama, trie, is one viable way to address the data stream phe-
Kitano, Satou, & Ito, 2002). It is also successfully nomenon. Data structures first appeared when program-
applied in gene expression analysis to discover biologi- ming became increasingly complex during the 1960s. In
cally relevant associations between different genes or his classic book, The Art of Computer Programming
between different environment conditions (Creighton Knuth (1968) reviewed and analyzed algorithms and data
& Hanash, 2003). structures that are necessary for program efficiency.

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Association Rule Mining

Since then, the traditional data structures have been The Frequent Pattern-growth (FP-growth) algo-
extended, and new algorithms have been introduced for rithm is a recent association rule mining algorithm that
them. Though computing power has increased tremen- achieves impressive results (Han, Pei, Yin, & Mao,
dously over the years, efficient algorithms with custom- 2004). It uses a compact tree structure called a Fre-
ized data structures are still necessary to obtain timely quent Pattern-tree (FP-tree) to store information about
and accurate results. This fact is especially true for ARM, frequent 1-itemsets. This compact structure removes
which is a computationally intensive process. the need for multiple database scans and is constructed
The trie is a multiway tree structure that allows fast with only 2 scans. In the first database scan, frequent 1-
searches over string data. In addition, as strings with itemsets are obtained and sorted in support descending
common prefixes share the same nodes, storage space order. In the second scan, items in the transactions are
is better utilized. This makes the trie very useful for first sorted according to the order of the frequent 1-
storing large dictionaries of English words. Figure 1 itemsets. These sorted items are used to construct the
shows a trie storing four English words (ape, apple, FP-tree. Figure 2 shows an FP-tree constructed from
base, and ball). Several novel trielike data structures the database in Table 1.
have been introduced to improve the efficiency of ARM, FP-growth then proceeds to recursively mine FP-
and we discuss them in this section. trees of decreasing size to generate frequent itemsets
Amir, Feldman, & Kashi (1999) presented a new way without candidate generation and database scans. It does
of mining association rules by using a trie to preprocess so by examining all the conditional pattern bases of
the database. In this approach, all transactions are mapped the FP-tree, which consists of the set of frequent itemsets
onto a trie structure. This mapping involves the extrac- occurring with the suffix pattern. Conditional FP-trees
tion of the powerset of the transaction items and the are constructed from these conditional pattern bases,
updating of the trie structure. Once built, there is no and mining is carried out recursively with such trees to
longer a need to scan the database to obtain support discover frequent itemsets of various sizes. However,
counts of itemsets, because the trie structure contains because both the construction and the use of the FP-
all their support counts. To find frequent itemsets, the trees are complex, the performance of FP-growth is
structure is traversed by using depth-first search, and reduced to be on par with Apriori at support thresholds
itemsets with support counts satisfying the minimum of 3% and above. It only achieves significant speed-ups
support threshold are added to the set of frequent at support thresholds of 1.5% and below. Moreover, it is
itemsets. only incremental to a certain extent, depending on the
Drawing upon that work, Yang, Johar, Grama, & FP-tree watermark (validity support threshold). As new
Szpankowski (2000) introduced a binary Patricia trie to transactions arrive, the support counts of items in-
reduce the heavy memory requirements of the prepro- crease, but their relative support frequency may de-
cessing trie. To support faster support queries, the crease, too. Suppose, however, that the new transactions
authors added a set of horizontal pointers to index cause too many previously infrequent itemsets to become
nodes. They also advocated the use of some form of
primary threshold to further prune the structure. How-
ever, the compression achieved by the compact Patricia Table 1. A sample transactional database
trie comes at a hefty price: It greatly complicates the
TID Items
horizontal pointer index, which is a severe overhead. In 100 AC
addition, after compression, it will be difficult for the 200 BC
Patricia trie to be updated whenever the database is 300 ABC
altered. 400 ABCD

Figure 1. An example of a trie for storing English


words Figure 2. An FP-tree constructed from the database in
Table 1 at a support threshold of 50%
ROOT

A B ROOT
P A
C
E P S L
A B
L E L

E B

60

TEAM LinG
Association Rule Mining

frequent that is, the watermark is raised too high (in It is highly scalable with respect to the size of both
order to make such itemsets infrequent) according to a the database and the universal itemset. A
user-defined level then the FP-tree must be recon- It is incrementally updated as transactions are
structed. added or deleted.
The use of lattice theory in ARM was pioneered by It is constructed independent of the support
Zaki (2000). Lattice theory allows the vast search space threshold and thus can be used for various support
to be decomposed into smaller segments that can be thresholds.
tackled independently in memory or even in other ma- It helps to speed up ARM algorithms to a certain
chines, thus promoting parallelism. However, they re- extent that allows results to be obtained in real-
quire additional storage space as well as different tra- time.
versal and construction techniques. To complement the
use of lattices, Zaki uses a vertical database format, We shall now discuss our novel trie data structure
where each itemset is associated with a list of transac- that not only satisfies the above requirements but also
tions known as a tid-list (transaction identifierlist). outperforms the discussed existing structures in terms
This format is useful for fast frequency counting of of efficiency, effectiveness, and practicality. Our struc-
itemsets but generates additional overheads because most ture is termed Support-Ordered Trie Itemset (SOTrieIT
databases have a horizontal format and would need to be pronounced so-try-it). It is a dual-level support-
converted first. ordered trie data structure used to store pertinent
The Continuous Association Rule Mining Algorithm itemset information to speed up the discovery of fre-
(CARMA), together with the support lattice, allows the quent itemsets.
user to change the support threshold and continuously As its construction is carried out before actual
displays the resulting association rules with support and mining, it can be viewed as a preprocessing step. For
confidence bounds during its first scan/phase (Hidber, every transaction that arrives, 1-itemsets and 2-itemsets
1999). During the second phase, it determines the pre- are first extracted from it. For each itemset, the
cise support of each itemset and extracts all the frequent SOTrieIT will be traversed in order to locate the node
itemsets. CARMA can readily compute frequent itemsets that stores its support count. Support counts of 1-
for varying support thresholds. However, experiments itemsets and 2-itemsets are stored in first-level and
reveal that CARMA only performs faster than Apriori at second-level nodes, respectively. The traversal of the
support thresholds of 0.25% and below, because of the SOTrieIT thus requires at most two redirections, which
tremendous overheads involved in constructing the sup- makes it very fast. At any point in time, the SOTrieIT
port lattice. contains the support counts of all 1-itemsets and 2-
The adjacency lattice, introduced by Aggarwal & Yu itemsets that appear in all the transactions. It will then
(2001), is similar to Zakis boolean powerset lattice, be sorted level-wise from left to right according to the
except the authors introduced the notion of adjacency support counts of the nodes in descending order.
among itemsets, and it does not rely on a vertical data- Figure 3 shows a SOTrieIT constructed from the
base format. Two itemsets are said to be adjacent to each database in Table 1. The bracketed number beside an item
other if one of them can be transformed to the other with is its support count. Hence, the support count of itemset
the addition of a single item. To address the problem of {AB} is 2. Notice that the nodes are ordered by support
heavy memory requirements, a primary threshold is de- counts in a level-wise descending order.
fined. This term signifies the minimum support thresh- In algorithms such as FP-growth that use a similar
old possible to fit all the qualified itemsets into the data structure to store itemset information, the structure
adjacency lattice in main memory. However, this ap- must be rebuilt to accommodate updates to the universal
proach disallows the mining of frequent itemsets at
support thresholds lower than the primary threshold.
Figure 3. A SOTrieIT structure

ROOT
MAIN THRUST

As shown in our previous discussion, none of the exist-


ing data structures can effectively address the issues C(4) A(3) B(3) D(1)
induced by the data stream phenomenon. Here are the
desirable characteristics of an ideal data structure that
can help ARM cope with data streams:
D(1) C(3) B(2) D(1) C(3) D(1)

61

TEAM LinG
Association Rule Mining

itemset. The SOTrieIT can be easily updated to accommo- more varied to cater to a broad customer base; transac-
date the new changes. If a node for a new item in the tion databases will grow in both size and complexity.
universal itemset does not exist, it will be created and Hence, association rule mining research will certainly
inserted into the SOTrieIT accordingly. If an item is continue to receive much attention in the quest for
removed from the universal itemset, all nodes contain- faster, more scalable and more configurable algorithms.
ing that item need only be removed, and the rest of the
nodes would still be valid.
Unlike the trie structure of Amir et al. (1999), the CONCLUSION
SOTrieIT is ordered by support count (which speeds up
mining) and does not require the powersets of transac- Association rule mining is an important data mining task
tions (which reduces construction time). The main weak- with several applications. However, to cope with the
ness of the SOTrieIT is that it can only discover frequent current explosion of raw data, data structures must be
1-itemsets and 2-itemsets; its main strength is its speed utilized to enhance its efficiency. We have analyzed
in discovering them. They can be found promptly be- several existing trie data structures used in association
cause there is no need to scan the database. In addition, rule mining and presented our novel trie structure, which
the search (depth first) can be stopped at a particular has been proven to be most useful and practical. What
level the moment a node representing a nonfrequent lies ahead is the parallelization of our structure to
itemset is found, because the nodes are all support further accommodate the ever-increasing demands of
ordered. todays need for speed and scalability to obtain associa-
Another advantage of the SOTrieIT, compared with tion rules in a timely manner. Another challenge is to
all previously discussed structures, is that it can be design new data structures that facilitate the discovery
constructed online, meaning that each time a new trans- of trends as association rules evolve over time. Differ-
action arrives, the SOTrieIT can be incrementally up- ent association rules may be mined at different time
dated. This feature is possible because the SOTrieIT is points and, by understanding the patterns of changing rules,
constructed without the need to know the support thresh- additional interesting knowledge may be discovered.
old; it is support independent. All 1-itemsets and 2-
itemsets in the database are used to update the SOTrieIT
regardless of their support counts. To conserve storage REFERENCES
space, existing trie structures such as the FP-tree have
to use thresholds to keep their sizes manageable; thus, Aggarwal, C. C., & Yu, P. S. (2001). A new approach to
when new transactions arrive, they have to be recon- online generation of association rules. IEEE Transac-
structed, because the support counts of itemsets will tions on Knowledge and Data Engineering, 13(4), 527-
have changed. 540.
Finally, the SOTrieIT requires far less storage space
than a trie or Patricia trie because it is only two levels Agrawal, R., & Srikant, R. (1994). Fast algorithms for
deep and can be easily stored in both memory and files. mining association rules. Proceedings of the 20th In-
Although this causes some input/output (I/O) overheads, ternational Conference on Very Large Databases (pp.
it is insignificant as shown in our extensive experiments. 487-499), Chile.
We have designed several algorithms to work synergis-
tically with the SOTrieIT and, through experiments with Amir, A., Feldman, R., & Kashi, R. (1999). A new and
existing prominent algorithms and a variety of databases, versatile method for association generation. Informa-
we have proven the practicality and superiority of our tion Systems, 22(6), 333-347.
approach (Das, Ng, & Woon, 2001; Woon et al., 2001). In Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom,
fact, our latest algorithm, FOLD-growth, is shown to J. (2002). Models and issues in data stream systems.
outperform FP-growth by more than 100 times (Woon, Proceedings of the ACM SIGMOD/PODS Conference
Ng, & Lim, 2004). (pp. 1-16), USA.
Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999).
FUTURE TRENDS Using association rules for product assortment deci-
sions: A case study. Proceedings of the Fifth ACM
The data stream phenomenon will eventually become SIGKDD Conference (pp. 254-260), USA.
ubiquitous as Internet access and bandwidth become Creighton, C., & Hanash, S. (2003). Mining gene expres-
increasingly affordable. With keen competition, prod- sion databases for association rules. Bioinformatics,
ucts will become more complex with customization and 19(1), 79-86.

62

TEAM LinG
Association Rule Mining

Das, A., Ng, W. K., & Woon, Y. K. (2001). Rapid associa- Woon, Y. K., Ng, W. K., & Lim, E. P. (2002). Online and
tion rule mining. Proceedings of the 10th International incremental mining of separately grouped web access A
Conference on Information and Knowledge Manage- logs. Proceedings of the Third International Conference
ment (pp. 474-481), USA. on Web Information Systems Engineering (pp. 53-62),
Singapore.
Dong, G., & Li, J. (1999). Efficient mining of emerging
patterns: Discovering trends and differences. Proceed- Woon, Y. K., Ng, W. K., & Lim, E. P. (2004). A support-
ings of the Fifth International Conference on Knowl- ordered trie for fast frequent itemset discovery. IEEE
edge Discovery and Data Mining (pp. 43-52), USA. Transactions on Knowledge and Data Engineering,
16(5).
Elliott, K., Scionti, R., & Page, M. (2003). The confluence
of data mining and market research for smarter CRM. Yang, D. Y., Johar, A., Grama, A., & Szpankowski, W.
Retrieved from http://www.spss.com/home_page/ (2000). Summary structures for frequency queries on
wp133.htm large transaction sets. Proceedings of the Data Com-
pression Conference (pp. 420-429).
Han, J., Pei, J., Yin Y., & Mao, R. (2004). Mining frequent
patterns without candidate generation: A frequent-pat- Yiu, M. L., & Mamoulis, N. (2003). Frequent-pattern based
tern tree approach. Data Mining and Knowledge Discov- iterative projected clustering. Proceedings of the Third
ery, 8(1), 53-97. International Conference on Data Mining, USA.
Hidber, C. (1999). Online association rule mining. Pro- Zaki, M. J. (2000). Scalable algorithms for association
ceedings of the ACM SIGMOD Conference (pp. 145-154), mining. IEEE Transactions on Knowledge and Data
USA. Engineering, 12(3), 372-390.
Knuth, D.E. (1968). The art of computer programming, Vol.
1. Fundamental Algorithms. Addison-Wesley Publish-
ing Company. KEY TERMS
Lawrence, R. D., Almasi, G. S., Kotlyar, V., Viveros, M. S.,
& Duri, S. (2001). Personalization of supermarket product Apriori: A classic algorithm that popularized asso-
recommendations. Data Mining and Knowledge Discov- ciation rule mining. It pioneered a method to generate
ery, 5(1/2), 11-32. candidate itemsets by using only frequent itemsets in
the previous pass. The idea rests on the fact that any
Oyama, T., Kitano, K., Satou, K., & Ito, T. (2002). Extrac- subset of a frequent itemset must be frequent as well.
tion of knowledge on protein-protein interaction by asso- This idea is also known as the downward closure prop-
ciation rule discovery. Bioinformatics, 18(5), 705-714. erty.
Sarda, N. L., & Srinivas, N. V. (1998). An adaptive algo- Itemset: An unordered set of unique items, which
rithm for incremental mining of association rules. Pro- may be products or features. For computational effi-
ceedings of the Ninth International Conference on Da- ciency, the items are often represented by integers. A
tabase and Expert Systems (pp. 240-245), Austria. frequent itemset is one with a support count that ex-
ceeds the support threshold, and a candidate itemset is
Wang, K., Zhou, S., & Han, J. (2002). Profit mining: From a potential frequent itemset. A k-itemset is an itemset
patterns to actions. Proceedings of the Eighth Interna- with exactly k items.
tional Conference on Extending Database Technology
(pp. 70-87), Prague. Key: A unique sequence of values that defines the
location of a node in a tree data structure.
Woon, Y. K., Li, X., Ng, W. K., & Lu, W. F. (2003).
Parameterless data compression and noise filtering us- Patricia Trie: A compressed binary trie. The
ing association rule mining. Proceedings of the Fifth Patricia (Practical Algorithm to Retrieve Information
International Conference on Data Warehousing and Coded in Alphanumeric) trie is compressed by avoiding
Knowledge Discovery (pp. 278-287), Prague. one-way branches. This is accomplished by including in
each node the number of bits to skip over before making
Woon, Y. K., Ng, W. K., & Das, A. (2001). Fast online the next branching decision.
dynamic association rule mining. Proceedings of the
Second International Conference on Web Information SOTrieIT: A dual-level trie whose nodes represent
Systems Engineering (pp. 278-287), Japan. itemsets. The position of a node is ordered by the support
count of the itemset it represents; the most frequent

63

TEAM LinG
Association Rule Mining

itemsets are found on the leftmost branches of the rithm has to be executed many times before this value can
SOTrieIT. be well adjusted to yield the desired results.
Support Count of an Itemset: The number of transac- Trie: An n-ary tree whose organization is based on
tions that contain a particular itemset. key space decomposition. In key space decomposition,
the key range is equally subdivided, and the splitting
Support Threshold: A threshold value that is used to position within the key range for each node is pre-
decide if an itemset is interesting/frequent. It is defined by defined.
the user, and generally, an association rule mining algo-

64

TEAM LinG
65

Association Rule Mining and Application to A


MPIS
Raymond Chi-Wing Wong
The Chinese University of Hong Kong, Hong Kong

Ada Wai-Chee Fu
The Chinese University of Hong Kong, Hong Kong

INTRODUCTION frequent itemsets. It is an iterative approach and there are


two steps in each iteration. The first step generates a set
Association rule mining (Agrawal, Imilienski, & Swami, of candidate itemsets. Then, the second step prunes all
1993) has been proposed for understanding the relation- disqualified candidates (i.e., all infrequent itemsets). The
ships among items in transactions or market baskets. For iterations begin with size 2 itemsets and the size is
instance, if a customer buys butter, what is the chance that incremented at each iteration. The algorithm is based on
he/she buys bread at the same time? Such information may the closure property of frequent itemsets: if a set of items
be useful for decision makers to determine strategies in a is frequent, then all its proper subsets are also frequent.
store. The weaknesses of this algorithm are the generation of a
large number of candidate itemsets and the requirement to
scan the database once in each iteration.
BACKGROUND A data structure called FP-tree and an efficient algo-
rithm called FP-growth are proposed by Han, Pei, & Yin
Given a set I = {I1, I2,, In} of items (e.g., carrot, orange and (2000) to overcome the above weaknesses. The idea of FP-
knife) in a supermarket. The database contains a number tree is fetching all transactions from the database and
of transactions. Each transaction t is a binary vector with inserting them into a compressed tree structure. Then,
t[k]=1 if t bought item Ik and t[k]=0 otherwise (e.g., {1, algorithm FP-growth reads from the FP-tree structure to
mine frequent itemsets.
0, 0, 1, 0}). An association rule is of the form X Ij, where
X is a set of some items in I, and Ij is a single item not in X
(e.g., {Orange, Knife} Plate).
A transaction t satisfies X if for all items Ik in X, t[k]
MAIN THRUST
= 1. The support for a rule X Ij is the fraction of trans-
actions that satisfy the union of X and Ij. A rule X Ij has Variations in Association Rules
confidence c% if and only if c% of transactions that
satisfy X also satisfy Ij. Many variations on the above problem formulation have
The mining process of association rule can be divided been suggested. The association rules can be classified
into two steps: based on the following (Han & Kamber, 2000):

1. Frequent Itemset Generation: Generate all sets of 1. Association Rules Based on the Type of Values of
items that have support greater than or equal to a Attribute: Based on the type of values of attributes,
certain threshold, called minsupport there are two kinds Boolean association rule,
2. Association Rule Generation: From the frequent which is presented above, and quantitative associa-
itemsets, generate all association rules that have tion rule. Quantitative association rule describes
confidence greater than or equal to a certain thresh- the relationships among some quantitative attributes
old called minconfidence (e.g., income and age). An example is
income(40K..50K) age(40..45). One proposed
Step 1 is much more difficult compared with Step 2. method is grid-based dividing each attribute into
Thus, researchers have focused on the studies of fre- a fixed number of partitions [Association Rule Clus-
quent itemset generation. tering System (ARCS) in Lent, Swami & Widom
The Apriori Algorithm is a well-known approach, (1997)]. Srikant & Agrawal (1996) proposed to par-
which was proposed by Agrawal & Srikant (1994), to find tition quantitative attributes dynamically and to

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Association Rule Mining and Application to MPIS

merge the partitions based on a measure of partial Figure 2. A concept hierarchy of the fruit
completeness. Another non-grid based approach is
found in Zhang, Padmanabhan, & Tuzhilin (2004).
2. Association Rules based on the Dimensionality of fruit
Data: Association rules can be divided into single-
dimensional association rules and multi-dimen-
sional association rules. One example of single-
dimensional rule is buys({Orange, Knife}) apple orange banana
buys(Plate), which contains only the dimension
buys. Multi-dimensional association rule is the one
containing attributes for more than one dimension.
For example, income(40K..50K) buys(Plate). One exists no itemset X such that (1) X X and (2) trans-
mining approach is to borrow the concept of data actions t, X is in t implies X is in t. These considerations
cube in the field of data warehousing. Figure 1 can reduce the resulting number of frequent itemsets
shows a lattice for the data cube for the dimensions significantly.
age, income and buys. Researchers (Kamber, Han, Another variation of the frequent itemset problem is
& Chiang, 1997) applied the data cube model and mining top-K frequent itemsets (Cheung & Fu, 2004). The
used the aggregate techniques for mining. problem is to find K frequent itemsets with the greatest
3. Association Rules based on the Level of Abstrac- supports. It is often more reasonable to assume the
tions of Attribute: The rules discussed in previous parameter K, instead of the data-distribution dependent
sections can be viewed as single-level association parameter of minsupport because the user typically would
rule. A rule that references different levels of ab- not have the knowledge of the data distribution before
straction of attributes is called a multilevel associa- data mining.
tion rule. Suppose there are two rules The other variations of the problem are the incremen-
income(10K..20K) buys(fruit) and tal update of mining association rules (Hidber, 1999),
income(10K..20K) buys(orange). There are two constraint-based rule mining (Grahne & Lakshmanan,
different levels of abstractions in these two rules 2000), distributed and parallel association rule mining
because fruit is a higher-level abstraction of or- (Gilburd, Schuster, & Wolff, 2004), association rule min-
ange. Han & Fu (1995) apply a top-down strategy ing with multiple minimum supports/without minimum
to the concept hierarchy in the mining of frequent support (Chiu, Wu, & Chen, 2004), association rule
itemsets. mining with weighted item and weight support (Tao,
Murtagh, & Farid, 2003), and fuzzy association rule min-
Other Extensions to Association Rule ing (Kuok, Fu, & Wong, 1998).
Mining Association rule mining has been integrated with
other data mining problems. There have been the integra-
There are other extensions to association rule mining. tion of classification and association rule mining (Wang,
Some of them (Bayardo, 1998) find maxpattern (i.e., maxi- Zhou, & He, 2000) and the integration of association rule
mal frequent patterns) while others (Zaki & Hsiao, 2002) mining with relational database systems (Sarawagi, Tho-
find frequent closed itemsets. Maxpattern is a frequent mas, & Agrawal, 1998).
itemset that does not have a frequent item superset. A
frequent itemset is a frequent closed itemsets if there Application of the Concept of
Association Rules to MPIS
Figure 1. A lattice showing the data cube for the
dimensions age, income, and buys Other than market basket analysis (Blischok, 1995), asso-
ciation rules can also help in applications such as intru-
() sion detection (Lee, Stolfo, & Mok, 1999), heterogeneous
genome data (Satou et al., 1997), mining remotely sensed
income images/data (Dong, Perrizo, Ding, & Zhou, 2000) and
(age) (buys)
product assortment decisions (Wong, Fu, & Wang, 2003;
Wong & Fu, 2004). Here we focus on the application on
(age, income) income, buys) product assortment decisions, as it is one of very few
age, buys)
examples where the association rules are not the end
(age, income, buys) mining results.

66

TEAM LinG
Association Rule Mining and Application to MPIS

Transaction database in some applications can be very fidence of I d, the more likely the profit of I should
large. For example, Hedberg (1995) quoted that Wal-Mart not be counted. This is the reasoning behind the above A
kept about 20 million sales transactions per day. Such data definition. In the above example, suppose we choose
requires sophisticated analysis. As pointed out by Blischok monitor and telephone. Then, d = {keyboard}. All profits
(1995), a major task of talented merchants is to pick the of monitor will be lost if, in the history, we find conf(I
profit generating items and discard the losing items. It may d)=1, where I = monitor. This example illustrates the
be simple enough to sort items by their profit and do the importance of the consideration of cross-selling factor in
selection. However, this ignores a very important aspect the profit estimation.
in market analysis the cross-selling effect. There can be Wong, Fu, & Wang (2003) propose two algorithms to
items that do not generate much profit by themselves but deal with this problem. In the first algorithm, they ap-
they are the catalysts for the sales of other profitable items. proximate the total profit of the item selection in qua-
Recently, some researchers (Kleinberg, Papadimitriou, & dratic form and solve a quadratic optimization problem.
Raghavan, 1998) suggest that concepts of association The second one is a greedy approach called MPIS_Alg,
rules can be used in the item selection problem with the which prunes items iteratively according to an estimated
consideration of relationships among items. function based on the formula of the total profit of the
One example of the product assortment decisions is item selection until J items remain.
Maximal-Profit Item Selection (MPIS) with cross-selling Another product assortment decision problem is stud-
considerations (Wong, Fu, & Wang, 2003). Consider the ied by Wong & Fu, (2004), which addresses the problem
major task of merchants to pick profit-generating items and of selecting a set of marketing items in order to boost the
discard the losing items. Assume we have a history record sales of the store.
of the sales (transactions) of all items. This problem is to
select a subset from the given set of items so that the
estimated profit of the resulting selection is maximal among FUTURE TRENDS
all choices.
Suppose a shop carries office equipment composed of A new area for investigation of the problem of mining
monitors, keyboards and telephones, with profits of frequent itemsets is mining data streaming for frequent
$1000K, $100K and $300K, respectively. If now the shop itemsets (Manku & Motwani, 2002; Yu, Chong, Lu, &
decides to remove one of the three items from its stock, the Zhou, 2004). In such a problem, the data is so massive
question is which two we should choose to keep. If we that all data cannot be stored in the memory of a computer
simply examine the profits, we may choose to keep moni- and the data cannot be processed by traditional algo-
tors and telephones, and so the total profit is $1300K. rithms. The objective of all proposed algorithm is to store
However, we know that there is strong cross-selling effect as few as possible data and to minimize the error gener-
between monitor and keyboard (see Table 1). If the shop ated by some estimation in the model.
stops carrying keyboard, the customers of monitor may Privacy preservation on the association rule mining
choose to shop elsewhere to get both items. The profit is also rigorously studied in these few years (Vaidya &
from monitor may drop greatly, and we may be left with Clifton, 2002; Agrawal, Evfimievski, & Srikant, 2003). The
profit of $300K from telephones only. If we choose to keep problem is to mine from two or more different sources
both monitors and keyboards, then the profit can be without exposing individual transaction data to each
expected to be $1100K, which is higher. other.
MPIS will give us the desired solution. MPIS utilizes
the concept of the relationship between selected items and
unselected items. Such relationship is modeled by the CONCLUSION
cross-selling factor. Suppose d is the set of unselected
items and I is the selected item. A loss rule is proposed in Association rule mining plays an important role in the
the form I d, where d means the purchase of any item literature of data mining. It poses many challenging
in d. The rule indicates that from the history, whenever a
customer buys the item I, he/she also buys at least one of
Table 1.
the items in d. Interpreting this as a pattern of customer
behavior, and assuming that the pattern will not change
Monitor Keyboard Telephone
even when some items were removed from the stock, if 1 1 0
none of the items in d are available then the customer also 1 1 0
will not purchase I. This is because if the customer still 0 0 1
0 0 1
purchases I, without purchasing any items in d, then the 0 0 1
pattern would be changed. Therefore, the higher the con- 1 1 1

67

TEAM LinG
Association Rule Mining and Application to MPIS

issues for the development of efficient and effective Hedberg, S. (1995, October). The data gold rush. BYTE, 83-99.
methods. After taking a closer look, we find that the
application of association rules requires much more in- Hidber, C. (1999), Online association rule mining. SIGMOD,
vestigations in order to aid in more specific targets. We 145-156.
may see a trend towards the study of applications of Kamber, M., Han, J., & Chiang, J.Y. (1997). Metarule-
association rules. guided mining of multi-dimensional association rules
using data cubes. In Proceeding of the 3rd International
Conference on Knowledge Discovery and Data Mining
REFERENCES (pp. 207-210).

Agrawal, R., Evfimievski, A., & Srikant, R. (2003). Informa- Kleinberg, J., Papadimitriou, C., & Raghavan, P. (1998). A
tion sharing across private database. SIGMOD, 86-97. microeconomic view of data mining. Knowledge Discov-
ery Journal, 2(4), 311-324.
Agrawal, R., Imilienski, T., & Swami. (1993). Mining asso-
ciation rules between sets of items in large databases. Kuok, C.M., Fu, A.W.C., & Wong, M.H., (1998). Mining
SIGMOD, 129-140. fuzzy association rules in databases. ACM SIGMOD
Record, 27(1), 41-46.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for
mining association rules. In Proceedings of the 20th VLDB Lee, W., Stolfo, S.J., & Mok, K.W. (1999). A data mining
Conference (pp. 487-499). framework for building intrusion detection models. In
IEEE Symposium on Security and Privacy (pp. 120-132).
Bayardo, R.J. (1998). Efficiently mining long patterns
from databases. SIGMOD, 85-93. Lent, B., Swami, A.N., & Widom, J. (1997). Clustering
Association Rules. In ICDE (pp. 220-231).
Blischok, T. (1995). Every transaction tells a story. Chain
Store Age Executive with Shopping Center Age, 71(3), Manku, G.S., & Motwani, R. (2002). Approximate fre-
50-57. quency counts over data streams. In Proceedings of the
20 th International Conference on VLDB (pp. 346-357).
Cheung, Y.L., & Fu, A.W.-C. (2004). Mining association
rules without support threshold: With and without Item Sarawagi, S., Thomas, S., & Agrawal, R. (1998). Integrat-
Constraints. TKDE, 16(9), 1052-1069. ing association rule mining with relational database sys-
tems: Alternatives and implications. SIGMOD, 343-354.
Chiu, D.-Y., Wu, Y.-H., & Chen, A.L.P. (2004). An efficient
algorithm for mining frequent sequences by a new strat- Satou, K., Shibayama, G., Ono, T., Yamamura, Y., Furuichi,
egy without support counting. ICDE, 375-386. E., Kuhara, S., & Takagi, T. (1997). Finding association
rules on heterogeneous genome data. In Pacific Sympo-
Dong, J., Perrizo, W., Ding, Q., & Zhou, J. (2000). The sium on Biocomputing (PSB) (pp. 397-408).
application of association rule mining to remotely sensed
data. In Proceedings of the 2000 ACM symposium on Srikant, R., & Agrawal, R. (1996). Mining quantitative
Applied computing (pp. 340-345). association rules in large relational tables. SIGMOD, 1-12.

Gilburd, B., Schuster, A., & Wolff, R. (2004). A new Tao, F., Murtagh, F., & Farid, M. (2003). Weighted asso-
privacy model and association-rule mining algorithm for ciation rule mining using weighted support and signifi-
large-scale distributed environments. SIGKDD. cance framework. In The Ninth ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data
Grahne, G., Lakshmanan, L., & Wang, X. (2000). Efficient Mining (pp. 661-666).
mining of constrained correlated sets. ICDE, 512-521
Vaidya, J., & Clifton, C. (2002). Privacy preserving asso-
Han, J., & Fu, Y. (1995). Discovery of multiple-level asso- ciation rule mining in vertically partitioned data. In The
ciation rules from large databases. In Proceedings of the Eighth ACM SIGKDD International Conference on
1995 International Conference on VLDB (pp. 420-431). Knowledge Discovery and Data Mining (pp. 639-644).
Han, J., & Kamber, M. (2000). Data mining: Concepts and Wang, K., Zhou, S., & He, Y. (2000). Growing decision
techniques. San Mateo, CA: Morgan Kaufmann Publishers. trees on support-less association rules. In Sixth ACM
SIGKDD International Conference on Knowledge Dis-
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns covery & Data Mining (pp. 265-269).
without candidate generation. SIGMOD, 1-12.

68

TEAM LinG
Association Rule Mining and Application to MPIS

Wong, R.C.-W., & Fu, A.W.-C. (2004). ISM: Item selection is a set of items and Ij is a single item not in X, is the fraction
for marketing with cross-selling considerations. In Ad- of the transactions containing all items in set X that also A
vances in Knowledge Discovery and Data Mining, 8th contain item Ij.
Pacific-Asia Conference (PAKDD) (pp. 431-440), Lecture
Notes in Computer Science 3056. Berlin: Springer. Frequent Itemset/Pattern: The itemset with support
greater than or equal to a certain threshold, called
Wong, R.C.-W., Fu, A.W.-C., & Wang, K. (2003). MPIS: minsupport.
Maximal-profit item selection with cross-selling consider-
ations. In IEEE International Conference on Data Min- Infrequent Itemset: Itemset with support smaller than
ing (ICDM) (pp. 371-378). a certain threshold, called minsupport.

Yu, J.X., Chong, Z., Lu, H., & Zhou, A. (2004). False Itemset: A set of items
positive or false negative: Mining frequent itemsets from K-Itemset: Itemset with k items
high speed transactional data streams. In Proceedings of
the Thirtieth International Conference on Very Large Maximal-Profit Item Selection (MPIS): The problem
Data Bases. of item selection, which selects a set of items in order to
maximize the total profit with the consideration of cross-
Zaki, M.J., & Hsiao, C.J. (2002). CHARM: An efficient selling effect
algorithm for closed itemset mining. In SIAM Interna-
tional Conference on Data Mining (SDM). Support (Itemset) Or Frequency: The support of an
itemset X is the fraction of transactions containing all
Zhang, H., Padmanabhan, B., & Tuzhilin, A. (2004). On the items in X.
discovery of significant statistical quantitative rules. In
Proceedings of the 10th ACM SIGKDD Knowledge Dis- Support (Rule): The support of a rule X Ij, where X
covery and Data Mining Conference. is a set of items and Ij is a single item not in X, is the fraction
of the transactions containing all items in set X that also
contain item Ij.
KEY TERMS
Transaction: A record containing the items bought
Association Rule: A kind of rule in the form X Ij, where by a customer.
X is a set of some items and Ij is a single item not in X.
Confidence: The confidence of a rule X I j, where X

69

TEAM LinG
70

Association Rule Mining of Relational Data


Anne Denton
North Dakota State University, USA

Christopher Besemann
North Dakota State University, USA

INTRODUCTION than one property per node or edge. Data associated


with nodes and edges can be modeled within the
Most data of practical relevance are structured in more relational algebra.
complex ways than is assumed in traditional data mining
algorithms, which are based on a single table. The concept Association rule mining of relational data incorpo-
of relations allows for discussing many data structures rates important aspects of these areas to form an innovative
such as trees and graphs. Relational data have much data mining technique of important practical relevance.
generality and are of significant importance, as demon-
strated by the ubiquity of relational database manage-
ment systems. It is, therefore, not surprising that popular MAIN THRUST
data mining techniques, such as association rule mining,
have been generalized to relational data. An important The general concept of association rule mining of rela-
aspect of the generalization process is the identification tional data will be explored, as well as the special case of
of problems that are new to the generalized setting. mining a relationship that corresponds to a graph.

General Concept
BACKGROUND
Two main challenges have to be addressed when applying
Several areas of databases and data mining contribute to association rule mining to relational data. Combined min-
advances in association rule mining of relational data: ing of multiple tables leads to a search space that is
typically large even for moderately sized tables. Perfor-
Relational Data Model: underlies most commercial mance is, thereby, commonly an important issue in rela-
database technology and also provides a strong tional data mining algorithms. A less obvious problem lies
mathematical framework for the manipulation of in the skewing of results (Jensen & Neville, 2002). The
complex data. Relational algebra provides a natural relational join operation combines each record from one
starting point for generalizations of data mining table with each occurrence of the corresponding record in
techniques to complex data types. a second table. That means that the information in one
Inductive Logic Programming, ILP (Deroski & record is represented multiple times in the joined table.
Lavra , 2001): a form of logic programming, in Data mining algorithms that operate either explicitly or
which individual instances are generalized to make implicitly on joined tables, thereby, use the same informa-
hypotheses about unseen data. Background knowl- tion multiple times. Note that this problem also applies to
edge is incorporated directly. algorithms in which tables are joined on-the-fly by iden-
Association Rule Mining, ARM (Agrawal, tifying corresponding records as they are needed. Further
Imielinski, & Swami, 1993): identifies associa- specific issues may have to be addressed when reflexive
tions and correlations in large databases. Associa- relationships are present. These issues will be discussed
tion rules are defined based on items, such as in the section on relations that represent a graph.
objects in a shopping cart. Efficient algorithms are A variety of techniques have been developed for data
designed by limiting output to sets of items that mining of relational data (D eroski & Lavra , 2001). A
occur more frequently than a given threshold. typical approach is called inductive logic programming,
Graph Theory: addresses networks that consist of ILP. In this approach relational structure is represented in
nodes, which are connected by edges. Traditional the form of Prolog queries, leaving maximum flexibility to
graph theoretic problems typically assume no more the user. While the notation of ILP differs from the

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Association Rule Mining of Relational Data

relational notation it can be noted that all relational A typical example of an association rule mining prob-
operators can also be represented in ILP. The approach lem is mining of annotation data of proteins in the pres- A
does thereby not limit the types of problems that can be ence of a protein-protein interaction graph (Oyama, Kitano,
addressed. It should, however, also be noted that while Satou, & Ito, 2002). Associations are extracted that relate
relational database management system are developed functions and localizations of one protein with those of
with performance in mind there may be a trade-off between interacting proteins. Oyama et al. use association rule
the generality of Prolog-based environments and their mining, as applied to joined relations, for this work.
limitations in speed. Another example could be association rule mining of
Application of ARM within the ILP setting corre- attributes associated with scientific publications on the
sponds to a search for frequent Prolog queries as a graph of their mutual citations.
generalization of traditional association rules (Dehaspe & A problem of the straight-forward approach of mining
De Raedt, 1997). Examples of association rule mining of joined tables directly becomes obvious upon further
relational data using ILP (Dehaspe & Toivonen, 2001) study of the rules: In most cases the output is dominated
could be shopping behavior of customers where relation- by rules that involve the same item as it occurs in different
ships between customers are included in the reasoning. entity instances that participate in a relationship. In the
While ILP does not use a relational joining step as such, example of protein annotations within the protein interac-
it does also associate individual objects with multiple tion graph a protein in the nucleus is found to fre-
occurrences of corresponding objects. Problems with quently interact with another protein that is also located
skewing are, thereby, also encountered in this approach. in the nucleus. Similarities among relational neighbors
An alternative to the ILP approach is to apply the have been observed more generally for relational data-
standard definition of association rule mining to relations bases (Macskassy & Provost, 2003). It can be shown that
that are joined using the relational join operation. While filtering of output is not a consistent solution to this
such an approach is less general it is often more efficient problem, and items that are repeated for multiple nodes
since the join operation is highly optimized in standard should be eliminated in a preprocessing step (Besemann
database systems. It is important to note that a join & Denton, 2004). This is an example of a problem that does
operation typically changes the support of an item set, not occur in association rule mining of a single table and
and any support calculation should therefore be based on requires special attention when moving to multiple rela-
the relation that uses the smallest number of join opera- tions. The example also highlights the need to discuss
tions (Cristofor & Simovici, 2001). Equivalent changes in differences between sets of items of related objects are
item set weighting occur in ILP. (Besemann, Denton, Yekkirala, Hutchison, & Anderson, 2004).
Interestingness of rules is an important issue in any
type of association rule mining. In traditional association Related Research Areas
rule mining the problem of rule interest has been ad-
dressed in a variety of work on redundant rules, including A related research area is graph-based ARM (Inokuchi,
closed set generation (Zaki, 2000). Additional rule metrics Washio, & Motoda, 2000; Yan & Han, 2002). Graph-based
such as lift and conviction have been defined (Brin, ARM does not typically consider more than one label on
Motwani, Ullman, & Tsur, 1997). In relational association each node or edge. The goal of graph-based ARM is to
rule mining the problem has been approached by the find frequent substructures based on that one label,
definition of a deviation measure (Dehaspe & Toivonen, focusing on algorithms that scale to large subgraphs. In
2001). In general it can be noted that relational data mining relational ARM multiple item are associated with each
poses many additional problems related to skewing of node and the main problem is to achieve scaling with
data compared with traditional mining on a single table respect to the number of items per node. Scaling to large
(Jensen & Neville, 2002). subgraphs is usually irrelevant due to the small world
property of many types of graphs. For most networks of
Relations that Represent a Graph practical interest any node can be reached from almost
any other by means of no more than some small number
One type of relational data set has traditionally received of edges (Barabasi & Bonabeau, 2003). Association rules
particular attention, albeit under a different name. A that involve longer distances are therefore unlikely to
relation representing a relationship between entity in- produce meaningful results.
stances of the same type, also called a reflexive relation- There are other areas of research on ARM in which
ship, can be viewed as the definition of a graph. Graphs related transactions are mined in some combined fashion.
have been used to represent social networks, biological Sequential pattern or episode mining (Agrawal & Srikant,
networks, communication networks, and citation graphs, 1995; Yan, Han, & Afshar, 2003) and inter-transaction
just to name a few. mining (Tung, Lu, Han, & Feng, 1999) are two main

71

TEAM LinG
Association Rule Mining of Relational Data

categories. Generally the interest in association rule min- Besemann, C., & Denton, A. (2004, June). UNIC: UNique
ing is moving beyond the single-table setting to incorpo- item counts for association rule mining in relational
rate the complex requirements of real-world data. data. Technical Report, North Dakota State University,
Fargo, North Dakota.
Besemann, C., Denton, A., Yekkirala, A., Hutchison, R.,
FUTURE TRENDS & Anderson, M. (2004, Aug.). Differential association
rule mining for the study of protein-protein interaction
The consensus in the data mining community of the impor- networks. In Proceedings ACM SIGKDD Workshop on
tance of relational data mining was recently paraphrased Data Mining in Bioinformatics, Seattle, WA.
by Dietterich (2003) as I.i.d. learning is dead. Long live
relational learning. The statistics, machine learning, and Brin, S., Motwani, R., Ullman, J.D., & Tsur, S. (1997).
ultimately data mining communities have invested de- Dynamic itemset counting and implication rules for mar-
cades into sound theories based on a single table. It is now ket basket data. In Proceedings of the ACM SIGMOD
time to afford as much rigor to relational data. When taking International Conference on Management of Data,
this step it is important to not only specify generalizations Tucson, AZ.
of existing algorithms but to also identify novel questions
that may be asked that are specific to the relational setting. Cristofor, L., & Simovici, D. (2001). Mining association
It is, furthermore, important to identify challenges that rules in entity-relationship modeled databases. Tech-
only occur in the relational setting, including skewing due nical Report, University of Massachusetts Boston.
to the application of the relational join operator, and Dehaspe, L., & De Raedt, L. (1997, Dec.). Mining associa-
correlations that are frequent in relational neighbors. tion rules in multiple relations. In Proceedings of the 7th
International Workshop on Inductive Logic Program-
ming (pp. 125-132), Prague, Czech Republic.
CONCLUSION
Dehaspe, L., & Toivonen, H. (2001). Discovery of rela-
Association rule mining of relational data is a powerful tional association rules. In S. D eroski, & N. Lavra
frequent pattern mining technique that is useful for several (Eds.), Relational data mining. Berlin: Springer.
data structures including graphs. Two main approaches Dietterich, T. (2003, Nov.). Sequential supervised learn-
are distinguished. Inductive logic programming provides ing: Methods for sequence labeling and segmentation.
a high degree of flexibility, while mining of joined relations Invited Talk, 3rd IEEE International Conference on
is a fast technique that allows the study problems related Data Mining, Melbourne, FL, USA.
to skewed or uninteresting results. The potential compu-
tational complexity of relational algorithms and specific D eroski, S., & Lavra , N. (2001). Relational data min-
properties of relational data make its mining an important ing. Berlin: Springer.
current research topic. Association rule mining takes a Inokuchi, A., Washio, T., & Motoda, H. (2000) An apriori-
special role in this process, being one of the most impor- based algorithm for mining frequent substructures from
tant frequent pattern algorithms. graph data. In Proceedings of the 4th European Confer-
ence on Principles of Data Mining and Knowledge
Discovery (pp. 13-23), Lyon, France.
REFERENCES
Jensen, D., & Neville, J. (2002). Linkage and
Agrawal, R., Imielinski, T., & Swami, A.N. (1993, May). autocorrelation cause feature selection bias in relational
Mining association rules between sets of items in large learning. In Proceedings of the 19th International Con-
databases. In Proceedings of the ACM International ference on Machine Learning (pp. 259-266), Sydney,
Conference on Management of Data (pp. 207-216), Wash- Australia.
ington, D.C. Macskassy, S., & Provost, F. (2003). A simple relational
Agrawal, R., & Srikant, R. (1995). Mining sequential pat- classifier. In Proceedings of the 2nd Workshop on Multi-
terns. In Proceedings of the 11 th International Conference Relational Data Mining at KDD03, Washington, D.C.
on Data Engineering (pp. 3-14), IEEE Computer Society Oyama, T., Kitano, K., Satou, K., & Ito, T. (2002). Extrac-
Press, Taipei, Taiwan. tion of knowledge on protein-protein interaction by as-
Barabasi, A.L., & Bonabeau, E. (2003). Scale-free net- sociation rule discovery. Bioinformatics, 18(8), 705-714.
works. Scientific American, 288(5), 60-69.

72

TEAM LinG
Association Rule Mining of Relational Data

Tung, A.K.H., Lu, H., Han, J., & Feng, L. (1999). Breaking Confidence: The confidence of a rule is the support
the barrier of transactions: Mining inter-transaction asso- of the item set consisting of all items in the rule (A ) A
ciation rules. In Proceedings of the International Confer- divided by the support of the antecedent.
ence on Knowledge Discovery and Data Mining, San
Diego, CA. Entity-Relationship Model (E-R-Model): A model to
represent real-world requirements through entities, their
Yan, X., & Han, J. (2002). gSpan: Graph-based substruc- attributes, and a variety of relationships between them. E-
ture pattern mining. In Proceedings of the International R-Models can be mapped automatically to the relational
Conference on Data Mining, Maebashi City, Japan. model.
Yan, X., Han, J., & Afshar, R. (2003). CloSpan: Mining Inductive Logic Programming (ILP): Research area
closed sequential patterns in large datasets. In Proceed- at the interface of machine learning and logic program-
ings of the 2003 SIAM International Conference on Data ming. Predicate descriptions are derived from examples
Mining, San Francisco, CA. and background knowledge. All examples, background
knowledge and final descriptions are represented as logic
Zaki, M.J. (2000). Generating non-redundant association programs.
rules. In Proceedings of the International Conference on
Knowledge Discovery and Data Mining (pp. 34-43), Boston, Redundant Association Rule: An association rule is
MA. redundant if it can be explained based entirely on one or
more other rules.

KEY TERMS Relation: A mathematical structure similar to a table in


which every row is unique, and neither rows nor columns
Antecedent: The set of items A in the association rule have a meaningful order.
A B. Relational Database: A database that has relations
Apriori: Association rule mining algorithm that uses and relational algebra operations as underlying math-
the fact that the support of a non-empty subset of an item ematical concepts. All relational algebra operations result
set cannot be smaller than the support of the item set itself. in relations as output. A join operation is used to combine
relations. The concept of a relational database was
Association Rule: A rule of the form A B meaning introduced by E. F. Codd at IBM in 1970.
if the set of items A is present in a transaction, then the Support: The support of an item set is the fraction of
set of items B is likely to be present too. A typical example transactions that have all items in that item set.
constitutes associations between items purchased at a
supermarket.

73

TEAM LinG
74

Association Rules and Statistics


Martine Cadot
University of Henri Poincar/LORIA, Nancy, France

Jean-Baptiste Maj
LORIA/INRIA, France

Tarek Ziad
NUXEO, France

INTRODUCTION first part is discussed and compared to statistics. Further-


more, in this article, only data structured in tables are used
A manager would like to have a dashboard of his company for association rules.
without manipulating data. Usually, statistics have solved
this challenge, but nowadays, data have changed (Jensen,
1992); their size has increased, and they are badly struc- MAIN THRUST
tured (Han & Kamber, 2001). A recent methoddata
mininghas been developed to analyze this type of data The problem differs with the number of variables. In the
(Piatetski-Shapiro, 2000). A specific method of data min- sequel, problems with two, three, or more variables are
ing, which fits the goal of the manager, is the extraction of discussed.
association rules (Hand, Mannila & Smyth, 2001). This
extraction is a part of attribute-oriented induction (Guyon Two Variables
& Elisseeff, 2003).
The aim of this paper is to compare both types of The link between two variables (A and B) depends on the
extracted knowledge: association rules and results of coding. The outcome of statistics is better when data are
statistics. quantitative. A current model is linear regression. For
instance, the salary (S) of a worker can be expressed by the
following equation:
BACKGROUND
S = 100 Y + 20000 + (1)
Statistics have been used by people who want to extract
knowledge from data for one century (Freeman, 1997). where Y is the number of years in the company, and is
Statistics can describe, summarize and represent the data. a random number. This model means that the salary of a
In this paper data are structured in tables, where lines are newcomer in the company is $20,000 and increases by
called objects, subjects or transactions and columns are $100 per year.
called variables, properties or attributes. For a specific The association rule for this model is: YS. This
variable, the value of an object can have different types: means that there are a few senior workers with a small
quantitative, ordinal, qualitative or binary. Furthermore, paycheck. For this, the variables are translated into binary
statistics tell if an effect is significant or not. They are variables. Y is not the number of years, but the property
called inferential statistics. has seniority, which is not quantitative but of type Yes/
Data mining (Srikant, 2001) has been developed to No. The same transformation is applied to the salary S,
precede a huge amount of data, which is the result of which becomes the property has a big salary.
progress in digital data acquisition, storage technology, Therefore, these two methods both provide the link
and computational power. The association rules, which between the two variables and have their own instruments
are produced by data-mining methods, express links on for measuring the quality of the link. For statistics, there
database attributes. The knowledge brought by the asso- are the tests of regression model (Baillargeon, 1996), and
ciation rules is shared in two different parts. The first for association rules, there are measures like support,
describes general links, and the second finds specific confidence, and so forth (Kodratoff, 2001). But, depend-
links (knowledge nuggets) (Fabris & Freitas, 1999; ing on the type of data, one model is more appropriate than
Padmanabhan & Tuzhilin, 2000). In this article, only the the other (Figure 1).

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Association Rules and Statistics

Figure 1. Coding and analysis methods variable has a particular effect on the link between Y and
S, called interaction (Winer, Brown & Michels, 1991). A
Quantitative Ordinal Qualitative Yes/No The association rules for this model are:

- Association Rules + YS, ES for the equation (2)


YS, ES, YES for the equation (3)
+ Statistics -
The statistical test of the regression model allows to
choose with or without interaction (2) or (3). For the
association rules, it is necessary to prune the set of three
Three Variables rules, because their measures do not give the choice
between a model of two rules and a model of three rules
If a third variable E, the experience of the worker, is (Zaki, 2000; Zhu, 1998).
integrated, the equation (1) becomes:
More Variables
S = 100 Y + 2000 E + 19000 + (2)
With more variables, it is difficult to use statistical models
E is the property has experience. If E=1, a new to test the link between variables (Megiddo & Srikant,
experienced worker gets a salary of $21,000, and if E=0, a 1998). However, there are still some ways to group vari-
new non-experienced worker gets a salary of $19,000. The ables: clustering, factor analysis, and taxonomy (Govaert,
increase of the salary, as a function of seniority (Y), is the 2003). But the complex links between variables, like inter-
same in both cases of experience. actions, are not given by these models and decrease the
quality of the results.
S = 50 Y + 1500 E + 50 E Y + 19500 + (3)
Comparison
Now, if E=1, a new experienced worker gets a salary of
$21,000, and if E=0, a new non-experienced worker gets a
Table 1 briefly compares statistics with the association
salary of $19,500. The increase of the salary, as a function
rules. Two types of statistics are described: by tests and
of seniority (Y), is $50 higher for experienced workers.
by taxonomy. Statistical tests are applied to a small amount
These regression models belong to a linear model of
of variables and the taxonomy to a great amount of
statistics (Prum, 1996), where, in the equation (3), the third

Table 1. Comparison between statistics and association rules

Statistics Data Mining


Tests Taxonomy Association rules
Decision Tests (+) Threshold defined (-) Threshold defined (-)
Level of Knowledge Low (-) High and simple (+) High and complex (+)
No. of Variables Small (-) High (+) Small and high (+)
Complex Link Yes (-) No (+) No (-)

Figure 2. (a) Regression equations; (b) taxonomy; (c) association rules

a) b) c)

75

TEAM LinG
Association Rules and Statistics

variables. In statistics, the decision is easy to make out of Govaert, G. (2003). Analyse de donnes. Lavoisier, France:
test results, unlike association rules, where a difficult Hermes-Science.
choice on several indices thresholds has to be performed.
For the level of knowledge, the statistical results need Gras, R., & Bailleul, M. (2001). La fouille dans les donnes
more interpretation relative to the taxonomy and the asso- par la mthode danalyse statistique implicative.
ciation rules. Colloque de Caen. Ecole polytechnique de lUniversit
Finally, graphs of the regression equations (Hayduk, de Nantes, Nantes, France.
1987), taxonomy (Foucart, 1997), and association rules Guyon, I., & Elisseeff, A. (2003). An introduction to
(Gras & Bailleul, 2001) are depicted in Figure 2. variable and feature selection: Special issue on variable
and feature selection. Journal of Machine Learning
Research, 3, 1157-1182.
FUTURE TRENDS
Han, J., & Kamber, M. (2001). Data mining: Concepts
With association rules, some researchers try to find the and techniques. San Francisco, CA: Morgan Kaufmann.
right indices and thresholds with stochastic methods. Hand, D., Mannila, H., & Smyth, P. (2001). Principles of
More development needs to be done in this area. Another data mining. Cambridge, MA: MIT Press.
sensitive problem is the set of association rules that is not
made for deductive reasoning. One of the most common Hayduk, L.A. (1987). Structural equation modelling
solutions is the pruning to suppress redundancies, con- with LISREL. Maryland: John Hopkins Press.
tradictions and loss of transitivity. Pruning is a new method Jensen, D. (1992). Induction with randomization test-
and needs to be developed. ing: Decision-oriented analysis of large data sets [doc-
toral thesis]. Washington University, Saint Louis, MO.

CONCLUSION Kodratoff, Y. (2001). Rating the interest of rules induced


from data and within texts. Proceedings of the 12th IEEE
With association rules, the manager can have a fully International Conference on Database and Expert Sys-
detailed dashboard of his or her company without manipu- tems Aplications-Dexa, Munich, Germany.
lating data. The advantage of the set of association rules Megiddo, N., & Srikant, R. (1998). Discovering predictive
relative to statistics is a high level of knowledge. This association rules. Proceedings of the Conference on
means that the manager does not have the inconvenience Knowledge Discovery in Data, New York.
of reading tables of numbers and making interpretations.
Furthermore, the manager can find knowledge nuggets Padmanabhan, B., & Tuzhilin, A. (2000). Small is beauti-
that are not present in statistics. ful: Discovering the minimal set of unexpected patterns.
The association rules have some inconvenience; how- Proceedings of the Conference on Knowledge Discov-
ever, it is a new method that still needs to be developed. ery in Data. Boston, Massachusetts.
Piatetski-Shapiro, G. (2000). Knowledge discovery in da-
tabases: 10 years after. Proceedings of the Conference on
REFERENCES Knowledge Discovery in Data, Boston, Massachusetts.
Baillargeon, G. (1996). Mthodes statistiques de Prum, B. (1996). Modle linaire: Comparaison de
lingnieur: Vol. 2. Trois-Riveres, Quebec: Editions SMG. groupes et rgression. Paris, France: INSERM.
Fabris,C., & Freitas, A. (1999). Discovery surprising pat- Srikant, R. (2001). Association rules: Past, present, fu-
terns by detecting occurrences of Simpsons paradox: ture. Proceedings of the Workshop on Concept Lattice-
Research and development in intelligent systems XVI. Based Theory, Methods and Tools for Knowledge Dis-
Proceedings of the 19th Conference of Knowledge-Based covery in Databases, California.
Systems and Applied Artificial Intelligence, Cambridge,
UK. Winer, B.J., Brown, D.R., & Michels, K.M.(1991). Statis-
tical principles in experimental design. New York:
Foucart, T. (1997). Lanalyse des donnes, mode demploi. McGraw-Hill.
Rennes, France: Presses Universitaires de Rennes.
Zaki, M.J. (2000). Generating non-redundant association
Freedman, D. (1997). Statistics. W.W. New York: Norton & rules. Proceedings of the Conference on Knowledge
Company. Discovery in Data, Boston, Massachusetts.

76

TEAM LinG
Association Rules and Statistics

Zhu, H. (1998). On-line analytical mining of association Linear Model: A variable is fitted by a linear combina-
rules [doctoral thesis]. Simon Fraser University, Burnaby, tion of other variables and interactions between them. A
Canada.
Pruning: The algorithms of extraction for the associa-
tion rule are optimized in computationality cost but not in
other constraints. This is why a suppression has to be
KEY TERMS performed on the results that do not satisfy special
constraints.
Attribute-Oriented Induction: Association rules, clas- Structural Equations: System of several regression
sification rules, and characterization rules are written with equations with numerous possibilities. For instance, a
attributes (i.e., variables). These rules are obtained from same variable can be made into different equations, and
data by induction and not from theory by deduction. a latent (not defined in data) variable can be accepted.
Badly Structured Data: Data, like texts of corpus or Taxonomy: This belongs to clustering methods and is
log sessions, often do not contain explicit variables. To usually represented by a tree. Often used in life categori-
extract association rules, it is necessary to create vari- zation.
ables (e.g., keyword) after defining their values (fre-
quency of apparition in corpus texts or simply apparition/ Tests of Regression Model: Regression models and
non apparition). analysis of variance models have numerous hypothesis,
e.g. normal distribution of errors. These constraints allow
Interaction: Two variables, A and B, are in interaction to determine if a coefficient of regression equation can be
if their actions are not seperate. considered as null with a fixed level of significance.

77

TEAM LinG
78

Automated Anomaly Detection


Brad Morantz
Georgia State University, USA

INTRODUCTION purchasing patterns. This is the semiotics of data, as we


transform data to information and finally to knowledge.
Preparing a dataset is a very important step in data mining. Dirty data, or data containing errors, are a major
If the input to the process contains problems, noise, or problem in this process. The old saying is, garbage in,
errors, then the results will reflect this, as well. Not all garbage out (Statsoft, 2004). Heuristic estimates are that
possible combinations of the data should exist, as the data 60-80% of the effort should go into preparing the data for
represent real-world observations. Correlation is expected mining, and only the small remaining portion actually is
among the variables. If all possible combinations were required for the data-mining effort itself. These data
represented, then there would be no knowledge to be records that are deviations from the common rule are
gained from the mining process. called anomalies.
The goal of anomaly detection is to identify and/or Data are always dirty and have been called the curse
remove questionable or incorrect observations. These of data mining (Berry & Linoff, 2000). Several factors can
occur because of keyboard error, measurement or record- be responsible for attenuating the quality of the data,
ing error, human mistakes, or other causes. Using knowl- among them errors, missing values, and outliers (Webb,
edge about the data, some standard statistical techniques, 2002). Missing data have many causes, varying from
and a little programming, a simple data-scrubbing program recording error to illegible writing to just not supplied.
can be written that identifies or removes faulty records. This is closely related to incorrect values that also can be
Duplicates can be eliminated, because they contribute no caused by poor penmanship as well as measurement error,
new knowledge. Real valued variables could be within keypunch mistakes, different or incorrect metrics, mis-
measurement error or tolerance of each other, yet each placed decimal, and other similar causes.
could represent a unique rule. Statistically categorizing the Fuzzy definitions, where the meaning of a value is
data would eliminate or, at least, greatly reduce this. either unclear or inconsistent, are another problem (Berry
In application of this process with actual datasets, & Linoff, 2000). Often, when something is being measured
accuracy has been increased significantly, in some cases and recorded, mistakes happen. Even automated processes
double or more. can produce dirty data (Bloom, 1998). Micro-array data has
errors due to base pairs on the probe not matching correctly
to genes in the test material (Shavlik et al., 2004). The
BACKGROUND sources of error are large, and it is necessary to have a
process that finds these anomalies and identifies them.
Data mining is an exploratory process looking for as yet In real valued datasets, the possible combinations are
unknown patterns (Westphal & Blaxton, 1998). The data (almost) unlimited. A dataset with eight variables, each
represent real-world occurrences, and there is correlation with four significant digits, could yield as many as 1032
among the variables. Some are principled in their con- combinations. Mining such a dataset would not only be
struction, one event triggering another. Sometimes events tedious and time-consuming, but possibly could yield an
occur in a certain order (Westphal & Blaxton, 1998). Not overly large number of patterns. Using (six-range) cat-
all possible combinations of the data are to be expected. egorical data, the same problem would only have 1.67 x 106
If this were not the case, then we would learn nothing from combinations. Gauss normally distributed data can be
this data. These methods allow us to see patterns and separated into plus or minus 1, 2, or 3 sigma. Other
regularities in large datasets (Mitchell, 1999). distributions can use Chebyshev or other distributions
Credit reporting agencies have been examining large with similar dividing points. There is no real loss of data,
datasets of credit histories for quite some time, trying to yet the process is greatly simplified.
determine rules that will help discern between problematic Finding the potentially bad observations or records
and responsible consumers (Mitchell, 1999). Datasets is the first problem. The second problem is what to do once
have been mined looking for indications for boiler explo- they are found. In many cases it is possible to go back and
sion probabilities to high-risk pregnancies to consumer verify the value, correcting it, if necessary. If this is

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Automated Anomaly Detection

possible, the program should flag values that are to be shucked (removed from the shell) for obvious reasons.
verified. This may not always be possible, or it may be too Other such rules from domain knowledge can be created A
expensive. Not all situations repeat within a reasonable (abalone.net, 2004; University of Capetown, 2004; World
time, if at all (i.e., observation of Halleys comet). Aquaculture, 2004). Sometimes, they may seem too obvi-
There are two schools of thought, the first being to ous, but they are effective. The rules can be programmed
substitute the mean value for the missing or wrong value. into a subroutine specific to the dataset.
The problem with this is that it might not be a reasonable Regression can be used to check for variables that are
value, and it can create a new rule, one that could be false not statistically significant. Step-wise regression is a
(i.e., shoe size for a giant is not average). It might introduce handy tool for identifying significant variables. Other
sample bias, as well (Berry & Linoff, 2000). ratio variables can be created and then checked for signifi-
Deleting the observation is the other common solu- cance using regression. Again, domain knowledge can
tion. Quite often, in large datasets, a duplicate exists, so help create these variables, as well as insight and some
deleting causes no loss. The cost of improper commission luck. Insignificant variables can be deleted from the
is greater than that of omission. Sometimes an outlier tells dataset, and new ones can be added.
a story. So, one has to be careful about deletions. If the dataset is real valued, it is possible that records
exist that are within tolerance or measurement error of
each other. There are two ways to reduce the number of
THE AUTOMATED ANOMALY unique observations. (1) Attenuate the accuracy by round-
DETECTION PROCESS ing to reduce the number of significant digits. Each
variable rounding to one less significant digit reduces the
number of possible patterns by an order of magnitude. (2)
Methodology Calculate a mean and standard deviation for the cleaned
dataset. Using an appropriate distribution, sort the values
To illustrate the process, a public dataset is used. This by standard deviations from the mean. Testing to see if the
particular one is available from the University of California chosen distribution is correct is accomplished by using a
at Irvine Machine Learning Repository (University of Cali- Chi square test, a Kolmogorof Smirnoff test, or the empiri-
fornia, 2003). Known as the Abalone dataset, it consists of cal test. The number of standard deviations replaces the
4,400 observations of abalones that were captured in the real valued data, and a simple categorical dataset will exist.
wild with several measurements of each one. Natural varia- This allows for simple comparisons between observa-
tion exists, as well as human error, both in making the tions. Otherwise, records with values as little as .0001%
measurements and in the recording. Also listed on the Web differences would be considered unique and different.
site were some studies that used the data and their results. While some of the precision of the original data is lost, this
Accuracy in the form of hit rate varied between 0-35%. process is exploratory and finds the general patterns that
While it may seem overly simple and obvious, plot- are in the data. This allows one to gain insight into the
ting the data is the first step. These graphical views can database using a combination of statistics and artificial
provide much insight into the data (Webb, 2002). The data intelligence (Pazzani, 2000), using human knowledge and
for each variable can be plotted vs. frequency of occur- skill as the catalyst to improve the results.
rence to visually determine distribution. Combining this The final step before mining the data is to remove
with knowledge of the research will help to determine the duplicates, as they add no additional information. As the
correct distribution to use for each included variable. A collection of observations gets increasingly larger, it gets
sum of independent terms would tend to support a Gauss harder to introduce new experiences. This process can be
normal distribution, while the product of a number of incorporated into the computer program by a simple
independent terms might suggest using log normal. This process that is similar to bubblesort. Instead of comparing
plotting also might suggest necessary transformations. to see which row is greater, it just looks for differences. If
It is necessary to understand the acceptable range for none are found, then the row is deleted.
each field. Some values obtained might not be reasonable.
If there is a zero in a field, is it indicative of a missing value,
or is it an acceptable value? No value is not the same as
Example Results
zero. Some values, while within bounds, might not be
possible. It is also necessary to check for obvious mis- A few variables were plotted producing, some very un-
takes, inconsistencies, or out of bounds. usual graphs. These were definitely not the graphs that
Knowledge about the subject of study is necessary. were expected. This was the first indication that the
From this, rules can be made. In the example of the abalone, dataset was noisy. Abalones are born in very large num-
the animal in the shell must weigh more than when it is bers, but with an extremely high infant mortality rate (over

79

TEAM LinG
Automated Anomaly Detection

99%) (Bamfield Marine Science Centre, 2004). This graph of the dataset and its plots led to some suspicion of the
did not reflect that. group with one ring (age = 2.5 years). OLS regression was
An initial scan of the data showed some inconsistent performed on this group, yielding an F of 27, but an R 2 of
points, like a five-year-old infant, a shucked animal weigh- only 0.03. This tells us that this portion of the data is only
ing more than a complete one, and other similar abnormali- muddying the water and attenuating the performance of
ties. Another problem with most analyses of these datasets our model.
is that gender is not ratio or ordinal data and, therefore, had Upon removal of this group of observations, OLS
to be converted to a dummy variable. regression was performed on the remaining data, giving
Step-wise regression removed all but five variables. an improved F of 639 (showing that, indeed, it is a good
The remaining variables were: diameter, height, whole model) and an R2 of 0.53, an acceptable level and one that
weight, shucked weight, and viscera weight. Two new can adequately describe the variation in the criterion.
variables were created: shell ratio (whole weight divided The results listed at the Web site where the dataset
by shell weight) and weight to diameter ratio. Since the was obtained are as follows:
diameter is directly proportional to volume, this variable is Sam Waugh in the Computer Science Department at
proportional to density. The proof of its significance was the University of Tasmania used this dataset in 1995 for
a t value of 39 and an F value of 1561. These are both his doctoral dissertation (University of California, 2003).
statistically significant. A plot of shell ratio vs. frequency His results, while the first recorded attempt, did not have
yielded a fairly Gauss normal looking curve. good accuracy at predicting the age. The problem was
As these are real valued data with four digits given, it encoded as a classification task.
is possible to have observations that vary by as little as
0.01%. This value is even less than the accuracy of the 24.86% Cascade Correlation (no hidden nodes)
measuring instruments. In other words, there are really a 26.25% Cascade Correlation (five hidden
relatively small number of possibilities, described by a nodes)
large number of almost identical examples, some within 21.5% C4.5
measurement tolerance of each other. 0.0% Linear Discriminant Analysis
The mean and standard deviation were calculated for 3.57% k=5 Nearest Neighbor
each of the remaining and new variables of the dataset. The
empirical test was done to verify approximate meeting of Clark, et al. (1996) did further work on this dataset.
Gauss normal distribution. Each value then was replaced They split the ring classification into three groups: 1 to
by the integer number of standard deviations it is from the 8, 9 to 10, and 11 and up. This reduced the number of
mean, creating a categorical dataset. Simple visual inspec- targets and made each one bigger, in effect making each
tion showed two things: (1) there was, indeed, correlation easier to hit.
among the observations; and (2) it became increasingly Their results were much better, as shown in the
more difficult to introduce a new pattern. following:
Duplicate removal process was the next step. As
expected, the first 50 observations only had 22% dupli- 64% Back propagation
cates, but by the time the entire dataset was processed, 55% Dystal
65% of the records were removed, because it presented no
new information. The results obtained from the answer tree using the
To better understand the quality of the data, least new cleaned dataset are shown in Table 1.
squares regression was performed. The model produced All of the one-ring observations were filtered out in
an ANOVA F value of 22.4, showing good confidence in a previous step, and the extraction was 100% accurate in
it. But the Pearsonian correlation coefficient R2 of only 0.25 not predicting any as being one-ring. The hit rates are as
indicated that there was some problem. Visual observation follows:

Table 1. Hit Rate


Actual Category
Predicted 1 2 3 4 Total
1 0 0 0 0 0
2 11 269 79 2 361
3 231 140 953 280 1604
4 22 3 27 81 133
Total 264 412 1059 363 2098

80

TEAM LinG
Automated Anomaly Detection

1 ring 100.0% correct of the Australian Conference on Neural Networks (ACNN


2 ring 74.5% correct 96), Canberra, Australia. A
3 ring 59.4% correct
4 ring 60.9% correct Mitchell, T. (1999). Machine learning and data mining.
Overall accuracy 62.1% correct Communications of the ACM, 42(11), 30-36.
Pazzani, M.J. (2000). Knowledge discovery from data?
IEEE Intelligent Systems, 15(2), 10-13.
FUTURE TRENDS
Shavlik, J., Molla, M., Waddell, M., & Page, D. (2004).
A program could be written that would input simple things Using machine learning to design and interpret gene-
like the number of variables, the number of observations, expression microarrays. American Association for Artifi-
and some classification results. A rule input mechanism cial Intelligence, 25(1), 23-44.
would accept the domain-specific rules and make them Statsoft, Inc. (2004). Electronic statistics textbook. Re-
part of the analysis. Further improvements would be the trieved from http://www.statsoft.com
inclusion of fuzzy logic. Type I would allow the use of
lingual variables (i.e., big, small, hot, cold) in the records, Tasmanian Abalone Council Ltd. (2004). http://
and type II would allow for some fuzzy overlap and fit. tasabalone.com.au
University of California at Irvine. (2003). Machine learn-
ing repository, abalone database. Retrieved from http:/
CONCLUSION /ics.uci.edu/~mlearn/MLRepository

Data mining is an exploratory process to see what is in the University of Capetown Zoology Department. (2004).
data and what patterns can be found. Noise and errors in http://web.uct.ac.za/depts/zoology/abnet
the dataset are reflected in the results from the mining Webb, A. (2002). Statistical pattern recognition. West
process. Cleaning the data and identifying anomalies Sussex, England: Wiley & Sons.
should be performed. Marked observations should be
verified and corrected, if possible. If this cannot be done, Westphal, C., & Blaxton, T. (1998). Data mining solutions
they should be deleted. In real valued datasets, the values methods and tools for solving real-world problems. New
can be categorized with accepted statistical techniques. York: Wiley & Sons.
Anomaly detection, after some manual viewing and analy-
sis, can be automated. Part of the process is specific to the World Aquaculture. (2004). http://www7.taosnet.com/
knowledge domain of the dataset, and part could be platinum/data/light/species/abalone.html
standardized. In our example problem, this cleaning pro-
cess improved results, and the mining produced a more
accurate rule set. KEY TERMS

Anomaly: A value or observation that deviates from


REFERENCES the rule or analogy. A potentially incorrect value.

Abalone.net. (2003). All about abalone: An online guide. ANOVA or Analysis of Variance: A powerful statis-
Retrieved from http://www.abalone.net tical method for studying the relationship between a
response or criterion variable and a set of one or more
Bamfield Marine Sciences Centre Public Education predictor or independent variable(s).
Programme. (2004). Oceanlink. Retrieved from http://
oceanlink.island.net/oinfo/Abalone/abalone.html Correlation: Amount of relationship between two
variables, how they change relative to each other, range:
Berry, M.J.A., & Linoff, G.S. (2000). Mastering data min- -1 to +1.
ing: The art and science of customer relationship man-
agement. New York, NY: Wiley & Sons, Inc. F Value: Fisher value, a statistical distribution, used
here to indicate the probability that an ANOVA model is
Bloom, D. (1998). Technology, experimentation, and the good. In the ANOVA calculations, it is the ratio of squared
quality of survey data. Science, 280(5365), 847-848. variances. A large number translates to confidence in the
Clark, D., Schreter, Z., & Adams, A. (1996). A quantitative model.
comparison of dystal and backpropagation. Proceedings

81

TEAM LinG
Automated Anomaly Detection

Ordinal Data: Data that is in order but has no relation- Step-Wise Regression: An automated procedure on
ship between the values or to an external value. statistical programs that adds one predictor variable at a
time, and if it is not statistically significant, it removes it
Pearsonian Correlation Coefficient: Defines how from the model. Some work in both directions either by
much of the variation in the criterion variable(s) is caused adding or removing from the model, one at a time.
by the model. Range: 0 to 1.
T Value, also called Students t: A statistical
Ratio Data: Data that is in order and has fixed spacing; distribution for smaller sample sizes. In regression rou-
a relationship between the points that is relative to a fixed tines in statistical programs, it indicates whether a predic-
external point. tor variable is statistically significant or if it truly is
contributing to the model. A value more than about 3 is
required for this indication.

82

TEAM LinG
83

Automatic Musical Instrument Sound A


Classification
Alicja A. Wieczorkowska
Polish-Japanese Institute of Information Technology, Poland

INTRODUCTION Aerophones are classified according to how the air is


set in motion, mainly depending on the mouthpiece: blow
The aim of musical instrument sound classification is to hole, whistle, reed, and lip-vibrated. Subcategories in-
process information from audio files by a classificatory clude flutes (end-blown, side-blown, nose, globular,
system and accurately identify musical instruments play- multiple), panpipes, whistle mouthpiece (recorder), single-
ing the processed sounds. This operation and its results and double-reed (clarinet, oboe), air chamber (pipe or-
are called automatic classification of musical instru- gans), lip-vibrated (trumpet or horn), and free aerophone
ment sounds. (bullroarers) (SIL, 1999).
The description of properties of musical instrument
sounds is usually given in vague subjective terms, like
BACKGROUND sharp, nasal, bright, and so forth, and only some of them
(i.e., brightness) have numerical counterparts. There-
Musical instruments are grouped into the following fore, one of the main problems in this research is to
categories (Hornbostel & Sachs, 1914): prepare the appropriate numerical sound description for
instrument recognition purposes.
Idiophones: Made of solid, non-stretchable, so- Automatic classification of musical instrument
norous material. sounds aims at classifying audio data accurately into
Membranophones: Skin drums; membranophones, appropriate groups representing instruments. This clas-
and idiophones are called percussion. sification can be performed at instrument level, instru-
Chordophones: Stringed instruments. ment family level (e.g., brass), or articulation (i.e., how
Aerophones: Wind instruments: woodwinds sound is struck, sustained, and released, e.g. vibrato -
(single-reed, double-reed, flutes) and brass (lip- varying the pitch of a note up and down) (Smith, 2000). As
vibrated) a preprocessing, the audio data are usually parameterized
(i.e., numerical or other parameters or attributes are as-
Idiophones are classified according to the material, signed, and then data mining techniques are applied to the
number of idiophones and resonators in a single instru- parameterized data). Accuracy of classification varies,
ment, and whether pitch or tuning is important. Subcat- depending on the audio data used in the experiments,
egories include idiophones struck together by concus- number of instruments, parameterization, classification,
sion (e.g., castanets), struck (gong), rubbed (musical and validation procedure applied. Automatic classifica-
glasses), scraped (washboards), stamped (hard floors tion compares favorably with human performance. Listen-
stamped with tap shoes), shaken (rattles), and plucked ers identify musical instruments with accuracy far from
(jews harp). perfect, with results depending on the sounds chosen and
Membranophones are classified according to their experience of listeners. Classification systems allow in-
shape, material, number of heads, if they have snares, etc., strument identification without participation of human
whether and how the drum is tuned, and how the skin is experts. Therefore, such systems can be valuable assis-
fixed and played. Subcategories include drums (cylindri- tance for users of audio data searching for specific timbre,
cal, conical, barrel, hourglass, goblet, footed, long, kettle, especially if they are not experienced musicians and when
frame, friction drum, and mirliton/kazoo). the amount of available audio data is huge, thus making
Chordophones are classified with respect to the manual searching impractical, if possible at all. When
relationship of the strings to the body of the instrument, combined with a melody-searching system, automatic
if they have frets (low bridges on the neck or body, instrument classification may provide a handy tool for
where strings are stopped) or movable bridges, number finding favorite tunes performed by favorite instruments
of strings, and how they are played and tuned. Subcat- in audio databases.
egories include zither, lute plucked, and bowed (e.g.,
guitars, violin), harp, lyre, and bow.

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Automatic Musical Instrument Sound Classification

MAIN THRUST transform. Some analyses are adjusted to the properties of


human hearing, which perceives changes of sound ampli-
Research on automatic classification of musical instru- tude and frequency in a logarithmic-like manner (e.g.,
ments so far has been performed mainly on isolated, frequency contents analysis in mel scale). The results
singular sounds; works on polyphonic sounds usually based on such analysis are easier to interpret in subjective
aim at source separation and operations like pitch track- terms. Also, statistic and mathematic operations are ap-
ing of these sounds (Viste & Evangelista, 2003). Most plied to the sound representation, yielding good results,
commonly used data include MUMS compact discs too. Some descriptors require calculating pitch of the
(Opolko & Wapnick, 1987), the University of Iowa sound, and any inaccuracies in pitch calculation (e.g.,
samples (Fritts, 1997), and IRCAMs Studio on Line octave errors) may lead to erroneous results.
(IRCAM, 2003). Parameter sets investigated in the research are usu-
A broad range of data mining techniques was applied ally a mixture of various types, since such combinations
in this research, aiming at extraction of information allow capturing more representative sound description
hidden in audio data (i.e., sound features that are com- for instrument classification purposes.
mon for a given instrument and differentiate it from the The following analysis and parameterization meth-
others). Descriptions of musical instrument sounds are ods are used to describe musical instrument sounds:
usually subjective, and finding appropriate numeric de-
scriptors (parameters) is a challenging task. Sound pa- Autocorrelation and cross-correlation functions
rameterization is arbitrarily chosen by the researchers, investigating periodicity of the signal and statisti-
and the parameters may reflect features that are known cal parameters of spectrum obtained via Fourier
to be important for the human in the instrument recog- transform: average amplitude and frequency varia-
nition task, like descriptors of sound evolution in time tions (wide in vibrated sounds), standard devia-
(i.e., onset features, depth of sound vibration, etc.), tions (Ando & Yamaguchi, 1993).
subjective timbre features (i.e., brightness of the sound), Contents of selected groups of partials in the
and so forth. Basically, parameters characterize coeffi- spectrum (Pollard & Jansson, 1982;
cients of sound analysis, since they are relatively easy Wieczorkowska, 1999a), including amount of even
to calculate. and odd harmonics (Martin & Kim, 1998), allow-
On the basis of parameterization, further research ing identification of clarinet sounds.
can be performed. Clustering applied to parameter vec- Vibrato strength and other changes of sound fea-
tors reveals similarity among sounds and adds a new tures in time (Martin & Kim, 1998; Wieczorkowska
glance on instrument classification, usually based on et al., 2003) and temporal envelope of the sound.
instrument construction or sound articulation. Deci- Statistical moments of the time wave, spectral
sion rules and trees allow identification of the most centroid (gravity center), coefficients of cepstrum
descriptive sound features. Transformation of sound pa- (i.e., the Fourier transform applied to the loga-
rameters may produce new descriptors better suited for rithm of amplitude plot of the spectrum), con-
automatic instrument classification. The classification stant-Q coefficients (i.e., for logarithmically-
can be performed hierarchically, taking instrument fami- spaced spectral bins) (Brown, 1999; Brown et al.,
lies and articulation into account. Classifiers represent a 2001).
broad range of methods, from simple statistic tools to Wavelet analysis, providing time-frequency plot
new advanced algorithms rooted in artificial intelligence. based on decomposition of sound signal into func-
tions called wavelets (Kostek & Czyzewski, 2001;
Parameterization Methods Wieczorkowska, 1999b).
Mel-frequency coefficients (i.e., in mel scale,
Sound is a physical disturbance in the medium (e.g., air) adjusted to the properties of human hearing) and
through which it is propagated (Whitaker & Benson, linear prediction cepstral coefficients, where fu-
2002). Periodic fluctuations are perceived as sound ture values are estimated as a linear function of
having pitch. The audible frequency range is about 20- previous values (Eronen, 2001).
20,000 Hz (hertz or cycles per second). The parameter- Multidimensional Scaling Analysis (MSA) trajec-
ization aims at capturing most distinctive sound fea- tories obtained through Principal Component
tures regarding sound amplitude evolution in time, static Analysis (PCA) applied to the constant-Q spectral
spectral features (frequency contents) of the most stable snapshots to determine the most significant at-
part of the sound, and evolution of frequency content in tributes of each sound (Kaminskyj, 2002). PCA
time. These features are based on the Fourier spectrum transforms a set of variables into a smaller set of
and time-frequency sound representations like wavelet

84

TEAM LinG
Automatic Musical Instrument Sound Classification

uncorrelated variables, which keep as much of the A statistical pattern-recognition technique; maxi-
variability in the data as possible. mum a posteriori classifier based on Gaussian mod- A
MPEG-7 audio descriptors, including log-attack time els (introducing prior probabilities), obtained via
(i.e., logarithm of onset duration, fundamental fre- Fisher multiple discriminant analysis that projects
quencypitch, spectral envelope, and spread, etc.) the high-dimensional feature space into a space of
(ISO, 2003; Peeters, et al., 2000). one dimension fewer than the number of classes in
which the classes are separated maximally (Martin
Feature vectors obtained via parameterization of & Kim, 1998).
musical instrument sounds are used as inputs for classi- Neural networks designed by analogy with a sim-
fiers, both for training and recognition purposes. plified model of the neural connections in the
brain and trained to find relationships in the data;
Classification Techniques multi-layer nets and self-organizing feature maps
have been used (Cosi et al., 1994; Kostek &
Automatic classification is the process by which a clas- Czyzewski, 2001).
sificatory system processes information in order to Decision trees, where nodes are labeled with
automatically classify data accurately, or the result of sound parameters, edges are labeled with param-
such a process. A class may represent an instrument, eter values, and leaves represent classes
articulation, instrument family, and so forth. Classifiers (Wieczorkowska, 1999b).
applied to this task range from probabilistic and statisti- Rough-set-based algorithms; rough sets are de-
cal algorithms through methods based on learning by fined by upper approximation, containing ele-
example, where classification is based on the distance ments that belong to the set for sure, and lower
between the observed sample and the nearest known approximation containing elements that may be-
neighbor, to methods originating from artificial intelli- long to the set (Wieczorkowska, 1999a).
gence like neural networks, which mimic neural connec- Support vector machines that aim at finding the
tions in the brain. Each classifier yields a new sound hyperplane that best separates observations be-
description (representation). Some classifiers produce an longing to different classes (Agostini et al., 2003).
explicit set of classification rules (e.g., decision trees or Hidden Markov Models (HMM) used for repre-
rough set based algorithms), giving insight into relation- senting sequences of states; in this case, can be
ships between specific sound timbres and the calculated used for representing long sequences of feature
features. Since human-performed recognition of musical vectors that define an instrument sound (Herrera
instruments is based on subjective criteria and difficult to et al., 2000).
formalize, learning algorithms that allow extraction of pre-
cise rules of sound classification are broadening our knowl- Classifiers are first trained and then tested with
edge and giving formal representation of subjective sound respect to their generalization purposes (i.e., whether
features. they work properly on unknown samples.
The following algorithms can be applied to musical
instrument sound classification: Validation and Results

Bayes decision rule (i.e., probabilistic classifica- Parameterization and classification methods yield vari-
tion method of assignment of unknown samples) to ous results, depending on the sound data and validation
the classes. In Brown (1999), training data were procedure when classifiers are tested on unseen
grouped into clusters obtained through k-means samples. Usually, the available data are divided into
algorithm, and Gaussian probability density func- training and test sets. For instance, 70% of the data is
tions were formed from the mean and variance of used for training and the remaining 30% for testing;
each cluster. this procedure is usually repeated a number of times,
K-Nearest Neighbor (k-NN) algorithm, where the and the final result is the average of all runs. Other
class (instrument) for a tested sound sample is popular divisions are in proportions 80/20 or 90/10.
assigned on the basis of the distances between the Also, leave-one-out procedure is used, where only one
vector of parameters for this sample and the major- sample is used for testing. Generally, the higher per-
ity of k nearest vectors representing known samples centage of the training data is in proportion to the test
(Kaminskyj, 2002; Martin & Kim, 1998). To im- data and the smaller the number of classes, the higher
prove performance, genetic algorithms are addi- the accuracy that is obtained. Some instruments are
tionally applied to find the optimal set of weights easily identified with high accuracy, whereas, others
for the parameters (Fujinaga & McMillan, 2000). frequently are misclassified, especially with those from

85

TEAM LinG
Automatic Musical Instrument Sound Classification

the same family. Classification of instruments is sometimes Rough-set-based classifiers and decision trees ap-
performed hierarchically: articulation or family is recog- plied to the data representing 18 classes (11 orches-
nized first, and then the instrument is identified. tral instruments, various articulation), parameter-
Following is an overview of results obtained so far in ized using Fourier and wavelet-based attributes,
the research on musical instrument sound classification: yielded 68-77 % accuracy in 90/10 test, and 64-68%
in 70/30 tests (Wieczorkowska, 1999b).
Brown (1999) reported an average 84.1% recog- K-NN and rough-set-based classifiers, applied to
nition accuracy for two classesoboe and saxo- spectral and temporal sound parameterization,
phoneusing cepstral coefficients as features yielded 68% accuracy in 80/20 tests for 18 classes,
and Bayes decision rules for clusters obtained via representing 11 orchestral instruments and vari-
k-means algorithm. ous articulation (Wieczorkowska et al., 2003).
Brown, Houix, and McAdams (2001), in experi-
ments with four classes, obtained 7984% accu- Generally, instrument families or sustained/impul-
racy for bin-to-bin differences of constant-Q co- sive sounds are identified with accuracy exceeding 90%,
efficients, and cepstral and autocorrelation coef- whereas instruments, if there are more than 10, are
ficients using Bayesian method. identified with accuracy reaching about 70%. These
K-NN classification applied to mel-frequency and results compare favorably with human performance and
linear prediction cepstral coefficients (Eronen, exceed results obtained for inexperienced listeners.
2001), with training on 29 orchestral instruments
and testing on 16 instruments from various re-
cordings, yielded 35% accuracy for instruments FUTURE TRENDS
and 77% for families. K-NN, combined with ge-
netic algorithms (Fujinaga & McMillan, 2000), Automatic indexing and searching of audio files is gain-
yielded 50% correctness in leave-one-out tests on ing increasing interest. MPEG-7 standard addresses the
spectral features representing 23 orchestral in- issue of content description in multimedia data, and
struments played with various articulations. audio descriptors provided in this standard form a basis
Kaminskyj (2002) applied k-NN to constant-Q for further research. Constant growth of audio resources
and cepstral coefficients, MSA trajectories, am- available on the Internet causes an increasing need for
plitude envelope, and spectral centroid. He ob- content-based search of audio data. Therefore, we can
tained 89-92% accuracy for instruments, 96% for expect intensification of research in this domain and
families, and 100% in identifying impulsive vs. progress of studies on automatic classification of mu-
sustained sounds in leave-one-out tests for MUMS sical instrument sounds.
data. Tests on other recordings initially yielded
33-61% accuracy, and 87-90% after improve-
ments. CONCLUSION
Multilayer neural networks applied to wavelet and
Fourier-based parameterization yielded 72-99% Results obtained so far in automatic musical instrument
accuracy for various groups of four instruments sound classification vary, depending on the size of the data,
(Kostek & Czyzewski, 2001). sound parameterization, classifier, and testing method.
The statistical pattern-recognition technique and Also, some instruments are identified easily with high
k-NN algorithm, applied to sounds representing accuracy, whereas others are misclassified frequently, in
14 orchestral instruments played with various ar- case of both human and machine performance.
ticulation, yielded 71.6% accuracy for instruments, Increasing interest in content-based searching
86.9% for families, and 98.8% in discriminating through audiovisual data and growth of amount of mul-
continuant sounds vs. pizzicato (Martin & Kim, timedia data available via the Internet raises the need and
1998) in 70/30 tests. The features included pitch, perspective for further progress in automatic classifi-
spectral centroid, ratio of odd-to-even harmonic cation of audio data.
energy, onset asynchrony, and the strength of vi-
brato and tremolo (quick changes of sound ampli-
tude or note repetitions).
Discriminant analysis and support vector machines
REFERENCES
yielded about 70% accuracy in leave-one-out tests
with spectral features for 27 instruments (Agostini Agostini, G., Longari, M., & Pollastri, E. (2003). Musi-
et al., 2003). cal instrument timbres classification with spectral fea-

86

TEAM LinG
Automatic Musical Instrument Sound Classification

tures. EURASIP Journal on Applied Signal Processing, Kostek, B., & Czyzewski, A. (2001). Representing musical
1, 1-11. instrument sounds for their automatic classification. Jour- A
nal of the Audio Engineering Society, 49(9), 768-785.
Ando, S., & Yamaguchi, K. (1993). Statistical study of
spectral parameters in musical instrument tones. Journal Martin, K.D., & Kim, Y.E. (1998). Musical instrument
of the Acoustical Society of America, 94(1), 37-45. identification: A pattern-recognition approach. Pro-
ceedings of the 136th Meeting of the Acoustical Society
Brown, J.C. (1999). Computer identification of musical of America, Norfolk, Virginia.
instruments using pattern recognition with cepstral co-
efficients as features. Journal of the Acoustical Soci- Opolko, F., & Wapnick, J. (1987). MUMSMcGill Univer-
ety of America, 105, 1933-1941. sity master samples. [CD-ROM]. McGill University,
Montreal, Quebec, Canada.
Brown, J.C., Houix, O., & McAdams, S. (2001). Feature
dependence in the automatic identification of musical Peeters, G., McAdams, S., & Herrera, P. (2000). Instrument
woodwind instruments. Journal of the Acoustical Soci- sound description in the context of MPEG-7. Proceedings
ety of America, 109, 1064-1072. of the International Computer Music Conference
ICMC2000, Berlin, Germany.
Cosi, P., De Poli, G., & Lauzzana, G. (1994). Auditory
modelling and self-organizing neural networks for tim- Pollard, H.F., & Jansson, E.V. (1982). A tristimulus method
bre classification. Journal of New Music Research, 23, for the specification of musical timbre. Acustica, 51, 162-
71-98. 171.
Eronen, A. (2001). Comparison of features for musical SIL. (1999). LinguaLinks library. Retrieved 2004 from
instrument recognition. Proceedings of the IEEE Work- http://www.sil.org/LinguaLinks/Anthropology/Expndd
shop on Applications of Signal Processing to Audio EthnmsclgyCtgrCltrlMtrls/MusicalInstrumentsSub
and Acoustics WASPAA 2001, New York, NY, USA. categorie.htm
Fritts, L. (1997). The University of Iowa musical instru- Smith, R. (2000). Rods encyclopedic dictionary of tradi-
ment samples. Retrieved 2004 from http://there tional music. Retrieved 2004 from http://
min.music.uiowa.edu/MIS.html www.sussexfolk.freeserve.co.uk/ency/a.htm
Fujinaga, I., & McMillan, K. (2000). Realtime recogni- Viste, H., & Evangelista, G. (2003). Separation of har-
tion of orchestral instruments. Proceedings of the Inter- monic instruments with overlapping partials in multi-
national Computer Music Conference, Berlin, Germany. channel mixtures. Proceedings of the IEEE Workshop
on Applications of Signal Processing to Audio and
Herrera, P., Amatriain, X., Batlle, E., & Serra, X. (2000). Acoustics WASPAA-03, New Paltz, New York.
Towards instrument segmentation for music content
description: A critical review of instrument classifica- Whitaker, J.C. & Benson, K.B. (Eds.). (2002). Stan-
tion techniques. Proceedings of the International Sym- dard handbook of audio and radio engineering. New
posium on Music Information Retrieval ISMIR 2000, York: McGraw-Hill.
Plymouth, Massachusetts.
Wieczorkowska, A. (1999a). Rough sets as a tool for
Hornbostel, E.M.V., & Sachs, C. (1914). Systematik der audio signal classification. Foundations of Intelligent
Musikinstrumente. Ein Versuch. Zeitschrift fr Systems LNCS/LNAI 1609, 11th Symposium on Method-
Ethnologie, 46(4-5), 553-90. ologies for Intelligent Systems, Proceedings/ISMIS99,
Warsaw, Poland.
IRCAM, Institute de Recherche et Coordination Acoustique/
Musique. (2003). Studio on line. Retrieved 2004 from http:// Wieczorkowska, A. (1999b). Skuteczno rozpoznawania
forumnet.ircam.fr/rubrique.php3?id_rubr ique=107 dwikw instrumentw muzycznych w zalenoci od
sposobu parametryzacji i rodzaju klasyfikatora (Effi-
ISO, International Organisation for Standardisation. (2003). ciency of musical instrument sounds recognition de-
MPEG-7 Overview. Retrieved 2004 from http:// pending on parameterization and classifier) [doctoral
www.chiariglio ne.org/mpeg/standards/mpeg-7/mpeg- thesis] [in Polish]. Gdansk: Technical University of
7.htm Gdansk.
Kaminskyj, I. (2002). Multi-feature musical instrument
Wieczorkowska, A., Wrblewski, J., Synak, P., & lzak ,
sound classifier w/user determined generalisation per-
D. (2003). Application of temporal descriptors to musical
formance. Proceedings of the Australasian Computer
instrument sound recognition. Journal of Intelligent
Music Association Conference ACMC 2002, Melbourne,
Information Systems, 21(1), 71-93.
Australia.
87

TEAM LinG
Automatic Musical Instrument Sound Classification

KEY TERMS Idiophones: The category of musical instruments made


of solid, non-stretchable sonorous material. Subcatego-
Aerophones: The category of musical instruments, ries: idiophones struck together by concussion (e.g., cas-
called wind instruments, producing sound by the vibra- tanets), struck (gong), rubbed (musical glasses), scraped
tion of air. Woodwinds and brass (lip-vibrated) instru- (washboards), stamped (hard floors stamped with tap
ments belong to this category, including single reed shoes), shaken (rattles), and plucked (jews harp).
woodwinds (e.g., clarinet), double reeds (oboe), flutes, Membranophones: The category of musical instru-
and brass (trumpet). ments; skin drums. Examples: daraboukka, tambourine.
Articulation: The process by which sounds are formed; Parameterization: The assignment of parameters
the manner in which notes are struck, sustained, and to represent processes that usually are not easily de-
released. Examples: staccato (shortening and detaching scribed by equations.
of notes), legato (smooth), pizzicato (plucking strings),
vibrato (varying the pitch of a note up and down), muted Sound: A physical disturbance in the medium through
(stringed instruments, by sliding a block of rubber or which it is propagated. This fluctuation may change
similar material onto the bridge; brass (by inserting a routinely, and such periodic sound is perceived as hav-
conical device into the bell). ing pitch. The audible frequency range is about 20
20,000 Hz (hertz, or cycles per second). Harmonic
Automatic Classification: The process by which a sound wave consists of frequencies being integer mul-
classificatory system processes information in order to tiples of the first component (fundamental frequency),
classify data accurately; also, the result of such process. corresponding to pitch.
Chordophones: The category of musical instru- Spectrum: The distribution of component frequen-
ments producing sound by means of a vibrating string; cies of the sound, each being a sine wave, with their
stringed instruments. Examples: guitar, violin, piano, harp, amplitudes and phases (time locations of these compo-
lyre, musical bow. nents). These frequencies can be determined through
Fourier analysis.

88

TEAM LinG
89

Bayesian Networks B
Ahmad Bashir
University of Texas at Dallas, USA

Latifur Khan
University of Texas at Dallas, USA

Mamoun Awad
University of Texas at Dallas, USA

INTRODUCTION is that each feature Fi is conditionally independent of


every other feature Fj. This situation is mathematically
A Bayesian network is a graphical model that finds represented as:
probabilistic relationships among variables of a system.
The basic components of a Bayesian network include a set P(Fi | C, Fj) = P(Fi | C)
of nodes, each representing a unique variable in the
system, their inter-relations, as indicated graphically by Such nave models are easier to compute, because
edges, and associated probability values. By using these they factor into P(C) and a series of independent probabil-
probabilities, termed conditional probabilities, and their ity distributions. The Nave Bayes classifier combines
interrelations, we can reason and calculate unknown this model with a decision rule. The common rule is to
probabilities. Furthermore, Bayesian networks have dis- choose the label that is most probable, known as the
tinct advantages compared to other methods, such as maximum a posteriori or MAP decision rule.
neural networks, decision trees, and rule bases, which we In a supervised learning setting, one wants to estimate
shall discuss in this paper. the parameters of the probability model. Because of the
independent feature assumption, it suffices to estimate
the class prior and the conditional feature models inde-
BACKGROUND pendently by using the method of maximum likelihood.
The Nave Bayes classifier has several properties that
Bayesian classification is based on Nave Bayesian clas- make it simple and practical, although the independence
sifiers, which we discuss in this section. Naive Bayesian assumptions are often violated. The overall classifier is
classification is the popular name for a probabilistic the robust to serious deficiencies of its underlying nave
classification. The term Naive Bayes refers to the fact that probability model, and in general, the Nave Bayes ap-
the probability model can be derived by using Bayes proach is more powerful than might be expected from the
theorem and that it incorporates strong independence extreme simplicity of its model; however, in the presence
assumptions that often have no bearing in reality, hence of nonindependent attributes wi, the Nave Bayesian
they are deliberately nave. Depending on the model, classifier must be upgraded to the Bayesian classifier,
Nave Bayes classifiers can be trained very efficiently in which will more appropriately model the situation.
a supervised learning setting. In many practical applica-
tions, parameter estimation for Nave Bayes models uses
the method of maximum likelihood. MAIN THRUST
Abstractly, the desired probability model for a classi-
fier is a conditional model The basic concept in the Bayesian treatment of certainties
in causal networks is conditional probability. When the
P(C | F1,,Fn) probability of an event A, P(A), is known, then it is
conditioned by other known factors. A conditional prob-
over a dependent class variable C with a small number of ability statement has the following form:
outcomes, or classes, conditional on several feature vari-
ables F1 through Fn. The nave conditional independence Given that event B has occurred, the probability of
assumptions play a role at this stage, when the probabili- the event A occurring is x.
ties are being computed. In this model, the assumption

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Bayesian Networks

Graphical Models Bayesian Probabilities

A graphical model visually illustrates conditional inde- Probability calculus does not require that the probabili-
pendencies among variables in a given problem. Two ties be based on theoretical results or frequencies of
variables that are conditionally independent have no repeated experiments, commonly known as relative fre-
direct impact on each others values. Furthermore, the quencies. Probabilities may also be completely subjective
graphical model shows any intermediary variables that estimates of the certainty of an event.
separate two conditionally independent variables. Consider an example of a basketball game. If one were
Through these intermediary variables, two conditionally to bet on an upcoming game between Team A and Team
independent variables affect one another. B, it is important to know the probability of Team A
A graph is composed of a set of nodes, which repre- winning the game. This probability is definitely not a ratio,
sent variables, and a set of edges. Each edge connects two a relative frequency, or even an estimate of a relative
nodes, and an edge can have an optional direction as- frequency; the game cannot be repeated many times
signed to it. For X1 and X2, if a causal relationship between under exactly the same conditions. Rather, the probability
the variables exists, the edge will be directional, leading represents only ones belief concerning Team As chances
from the case variable to the effect variable; if just a of winning. Such a probability is termed a Bayesian or
correlation between the variables exists, the edge will be subjective probability and makes use of Bayes theorem
undirected. to calculate unknown probabilities.
We use an example with three variables to illustrate A Bayesian probability may also be referred to as a
these concepts. In this example, two conditionally inde- personal probability. The Bayesian probability of an
pendent variables, A and C, are directly related to another event x is a persons degree of belief in that event. A
variable, B. To represent this situation, an edge must exist Bayesian probability is a property of the person who
between the nodes of the variables that are directly assigns the probability, whereas a classical probability
related, that is, between A and B and between B and C. is a physical property of the world, meaning it is the
Furthermore, the relationships between A and B and B and physical probability of an event.
C are correlations as opposed to causal relations; hence, An important difference between physical probability
the respective edges will be undirected. Figure 1 illus- and Bayesian probability is that repeated trials are not
trates this example. Due to conditional independence, necessary to measure the Bayesian probability. The Baye-
nodes A and C still have an indirect influence on one sian method can assign a probability for events that may
another; however, variable B encodes the information be difficult to experimentally determine. An oft-voiced
from A that impacts C, and vice versa. criticism of the Bayesian approach is that probabilities
A Bayesian network is a specific type of graphical seem arbitrary, but this is a probability assessment issue
model, with directed edges and no cycles (Stephenson, that does not take away from the many possibilities that
2000). The edges in Bayesian networks are viewed as Bayesian probabilities provide.
causal connections, where each parent node causes an
effect on its children. Causal Influence
In addition, nodes in a Bayesian network contain a
conditional probability table, or CPT, which stores all Bayesian networks require an operational method for
probabilities that may be used to reason or make infer- identifying causal relationships in order for accurate
ences within the system. domain modeling. Hence, causal influence is defined in
the following manner: If the action of making variable X
Figure 1. Graphical model of two independent variables take some value sometimes changes the value taken by
A and C that are directly related to a third variable B variable Y, then X is assumed to be responsible for some-
times changing Ys value, and one may conclude that X is
a cause of Y. More formally, X is manipulated when we
force X to take some value, and we say X causes Y if some
B manipulation of X leads to a change in the probability
distribution of Y.
Furthermore, if manipulating X leads to a change in the
probability distribution of Y, then X obtaining a value by
any means whatsoever also leads to a change in the
probability distribution of Y. Hence, one can make the
A C
natural conclusion that causes and their effects are statis-

90

TEAM LinG
Bayesian Networks

tically correlated. However, note that variables can be Advantages


correlated in less direct ways; that is, one variable may not B
necessarily cause the other. Rather, some intermediaries A Bayesian network is a graphical model that finds
may be involved. probabilistic relationships among variables of the sys-
tem, but a number of models are available for data analy-
Bayesian Networks: A First Look sis, including rule bases, decision trees, and artificial
neural networks. Several techniques for data analysis
This section provides a detailed definition of Bayesian also exist, including classification, density estimation,
networks in an attempt to merge the probabilistic, logic, regression, and clustering. However, Bayesian networks
and graphical modeling ideas that we have presented thus have distinct advantages that we discuss here.
far. One of the biggest advantages of Bayesian networks
Causal relations, apart from the reasoning that they is that they have a bidirectional message passing archi-
model, also have a quantitative side, namely their strength. tecture. Learning from the evidence can be interpreted as
This is expressed by associating numbers to the links. Let unsupervised learning. Similarly, expectation of an ac-
the variable A be a parent of the variable B in a causal tion can be interpreted as unsupervised learning. Be-
network. Using probability calculus, it will be normal to let cause Bayesian networks pass data between nodes and
the conditional probability be the strength of the link see the expectations from the world model, they can be
between these variables. On the other hand, if the variable considered as bidirectional learning systems (Helsper &
C is also a parent of the variable B, then conditional Gaag, 2002). In addition to bidirectional message pass-
probabilities P(B|A) and P(B|C) do not provide any infor- ing, Bayesian networks have several important features,
mation on how the impacts from variable A and variable B such as the allowance of subjective a priori judgments,
interact. They may cooperate or counteract in various direct representation of causal dependence, and the
ways. Therefore, the specification of P(B|A,C) is required. ability to imitate the human thinking process.
If loops exist within the reasoning, the domain model Bayesian networks handle incomplete data sets with-
may contain feedback cycles; these cycles are difficult to out difficulty because they discover dependencies
model quantitatively. For such networks, no calculus cop- among all the variables. When one of the inputs is not
ing with feedback cycles has been developed. Therefore, observed, most other models will end up with an inaccu-
the network must be loop free. In fact, from a practical rate prediction because they do not calculate the corre-
perspective, it should be modeled as closely to a tree as lation between the input variables. Bayesian networks
possible. suggest a natural way to encode these dependencies.
For this reason, they are a natural choice for image
Evidential Reasoning classification or annotation, because the lack of a clas-
sification can also be viewed as missing data.
As stated previously, Bayesian networks accomplish such Considering the Bayesian statistical techniques,
an economy by pointing out, for each variable Xi, the Bayesian networks facilitate the combination of domain
conditional probabilities P(Xi | pai) where pai is the set of knowledge and data (Neapolitan, 2004). Prior or domain
parents (of Xi) that render Xi independent of all its other knowledge is crucially important if one performs a real-
parents. After giving this specification, the joint probabil- world analysis, particularly when data are inadequate or
ity distribution can be calculated by the product expensive. The encoding of causal prior knowledge is
straightforward because Bayesian networks have causal
semantics. Additionally, Bayesian networks encode the
P( x1,..., xn ) = P( xi | pai ) strength of causal relationships with probabilities. There-
i
fore, prior knowledge and data can be put together with
Using this product, all probabilistic queries can be well studied techniques from Bayesian statistics for both
found coherently with probability calculus. Passing evi- forward and backward reasoning with the Bayesian net-
dence up and down a Bayesian network is known as belief work.
propagation, and exact probability calculations is NP- Bayesian networks also ease many of the theoretical
hard, meaning that the calculation time to compute exact and computational difficulties of rule-based systems by
probabilities grows exponentially with the size of the utilizing graphical structures for representing and man-
Bayesian network. A number of algorithms are available aging probabilistic knowledge. Independencies can be
for probabilistic calculations in Bayesian networks dealt with explicitly; they can be articulated, encoded
(Huang, King, & Lyu, 2002), beginning with Pearls graphically, read off the network, and reasoned about,
message passing algorithm. and yet they forever remain robust to numerical impres-

91

TEAM LinG
Bayesian Networks

sion. Moreover, graphical representations uncover sev- Bn) that is associated with each variable A with
eral opportunities for efficient computation and serve as parents B1, B2 Bn
understandable logic diagrams.
Bayesian networks can simulate humanlike reason- Bayesian networks continue to play a vital role in
ing; this fact is not, however, due to any structural prediction and classification within data mining
similarities with the human brain. Rather, it is because of (Niedermeyer, 1998). They are a marriage between prob-
the resemblance between the ways Bayesian networks ability theory and graph theory, providing a natural tool
and humans reason. The resemblance is more psychologi- for dealing with two problems that occur throughout
cal than biological but nevertheless a true benefit. applied mathematics and engineering: uncertainty and
complexity. Also, Bayesian networks play an increasingly
Bayesian Inference important role in the design and analysis of machine
learning algorithms, serving as a promising way to ap-
Inference is the task of computing the probability of each proach present and future problems related to artificial
value of a node in a Bayesian network when other vari- intelligence and data mining (Choudhary, Rehg, Pavlovic,
ables values are known (Jensen, 1999). This concept is & Pentland, 2002; Doshi, Greenwald, & Clarke, 2002;
what makes Bayesian networks so powerful, as it allows Fenton, Cates, Forey, Marsh, Neil, & Tailor, 2003).
the user to apply knowledge toward forward or backward
reasoning. Suppose that a specific value for one or more
of the variables in the network has been observed. If one REFERENCES
variable has a definite value, or evidence, the probabili-
ties, or belief values, for the other variables need to be Choudhury, T., Rehg, J. M., Pavlovic, V., & Pentland, A.
revised, as this variable is not a defined value. This (2002). Boosting and structure learning in dynamic Baye-
calculation of the updated probabilities for system vari- sian networks for audio-visual speaker detection. Pro-
ables that are based on new evidence is precisely the ceedings of the International Conference on Pattern
definition of inference. Recognition (ICPR), Canada, III (pp. 789-794).
Doshi, P., Greenwald, L., & Clarke, J. (2002). Towards
effective structure learning for large Bayesian networks.
FUTURE TRENDS Proceedings of the AAAI Workshop on Probabilistic
Approaches in Search, Canada (pp. 16-22).
The future of Bayesian networks lies in determining new
ways to tackle the following issues of Bayesian inferencing Fenton, N., Cates, P., Forey, S., Marsh, W., Neil, M., &
and in building a Bayesian structure that accurately Tailor, M. (2003). Modelling risk in complex software
represents a particular system. As we discuss in this projects using Bayesian networks (Tech. Rep. No. RA-
paper, conditional dependencies can be mapped into a DAR Tech Repo). London: Queen Mary University.
graph in several ways, each with subtle semantic and
statistical differences. Future research will give way to Helsper, E.M. & Gaag, L.C. van der. (2002). Building
Bayesian networks that can understand system seman- Bayesian networks through ontologies. Proceedings of
tics and adapt accordingly, not only with respect to the the 15th Eureopean Conference on Artificial Intelli-
conditional probabilities within each node but also with gence, Lyon, France (pp. 680-684).
respect to the graph itself. Huang, K., King, I., & R. Lyu, M. (2002). Learning maximum
likelihood semi-naive Bayesian network classifier. Pro-
ceedings of the IEEE International Conference on Sys-
CONCLUSION tems, Man, and Cybernetics, 3, 6, Hamammet, Tunisia.

A Bayesian network consists of the following elements: Jensen, F. (1999). Gradient descent training of Bayesian
networks. Proceedings of the European Conference on
A set of variables and a set of directed edges Symbolic and Quantitative Approaches to Reasoning
between variables and Uncertainty (pp. 190-200).
A finite set of mutually exclusive states for each Neapolitan, R. E. (2004). Learning Bayesian networks.
variable Upper Saddle River, NJ: Prentice-Hall.
A directed acyclic graph (DAG), constructed from
the variables coupled with the directed edges Niedermayer, D. (1998). An introduction to Bayesian
A conditional probability table (CPT) P(A | B1, B2, , networks and their contemporary applications. Re-

92

TEAM LinG
Bayesian Networks

trieved October 2004, from http://www.niedermayer.ca/ Independent: Two random variables are independent
papers/bayesian/ when knowing something about the value of one of them B
does not yield any information about the value of the
Stephenson, T. (2000). An introduction to Bayesian net- other.
work theory and usage. Retrieved October 2004, from
http://www.idiap.ch/publications/todd00a.bib.abs.html Joint Probability: The probability of two events oc-
curring in conjunction.

KEY TERMS Maximum Likelihood: Method of point estimation


using as an estimate of an unobservable population
Bayes Theorem: Result in probability theory that parameter the member of the parameter space that maxi-
states the conditional probability of a variable A, given B, mizes the likelihood function.
in terms of the conditional probability of variable B, given Neural Networks: Learning systems that are designed
A, and the marginal probability of A alone. by analogy with a simplified model of the neural connec-
Conditional Probability: Probability of some event A, tions in the brain and can be trained to find nonlinear
assuming event B, written mathematically as P(A|B). relationships in data.

Data Mining: The application of analytical methods Supervised Learning: A machine learning technique
and tools to data for the purpose of identifying patterns for creating a function from training data; the task of the
and relationships such as classification, prediction, esti- supervised learner is to predict the value of the function
mation, or affinity grouping. for any valid input object after having seen only a small
number of training data.

93

TEAM LinG
94

Best Practices in Data Warehousing from the


Federal Perspective
Les Pang
National Defense University, USA

INTRODUCTION proach, an organization can gain significant competitive


advantages through the new level of corporate knowledge.
Data warehousing has been a successful approach for Various agencies in the Federal Government at-
supporting the important concept of knowledge man- tempted to implement a data warehousing strategy in
agement one of the keys to organizational success at order to achieve data interoperability. Many of these
the enterprise level. Based on successful implementa- agencies have achieved significant success in improving
tions of warehousing projects, a number of lessons internal decision processes as well as enhancing the
learned and best practices were derived from these delivery of products and services to the citizen. This
project experiences. The scope was limited to projects chapter aims to identify the best practices that were
funded and implemented by federal agencies, military implemented as part of the successful data warehousing
institutions and organizations directly supporting them. projects within the federal sector.
Projects and organizations reviewed include the fol-
lowing:
MAIN THRUST
Census 2000 Cost and Progress System
Defense Dental Standard System Each best practice (indicated in boldface) and its ratio-
Defense Medical Logistics Support System Data nale are listed below. Following each practice is a
Warehouse Program description of illustrative project or projects (indicated
Department of Agriculture Rural Development in italics), which support the practice.
Data Warehouse
DOD Computerized Executive Information Sys- Ensure the Accuracy of the Source
tem Data to Maintain the Users Trust of
Department of Transportation (DOT) Executive
Reporting Framework System
the Information in a Warehouse
Environmental Protection Agency (EPA)
Envirofacts Warehouse The user of a data warehouse needs to be confident that
Federal Credit Union the data in a data warehouse is timely, precise, and
Health and Urban Development (HUD) Enterprise complete. Otherwise, a user that discovers suspect data
Data Warehouse in warehouse will likely cease using it, thereby reducing
Internal Revenue Service (IRS) Compliance Data the return on investment involved in building the ware-
Warehouse house. Within government circles, the appearance of
U.S. Army Operational Testing and Evaluation Com- suspect data takes on a new perspective.
mand HUD Enterprise Data Warehouse - Gloria Parker,
U.S. Coast Guard Executive Information System HUD Chief Information Officer, spearheaded data ware-
U.S. Navy Type Commanders Readiness Manage- housing projects at the Department of Education and at
ment System HUD. The HUD warehouse effort was used to profile
performance, detect fraud, profile customers, and do
what if analysis. Business areas served include Fed-
eral Housing Administration loans, subsidized proper-
BACKGROUND ties, and grants. She emphasizes that the public trust of
the information is critical. Government agencies do not
Data warehousing involves the consolidation of data want to jeopardize our public trust by putting out bad
from various transactional data sources in order to data. Bad data will result in major ramifications not only
support the strategic needs of an organization. This from citizens but also from the government auditing
approach links the various silos of data that is distrib- arm, the General Accounting Office, and from Congress
uted throughout an organization. By applying this ap- (Parker, 1999).
Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective

EPA Envirofacts Warehouse - The Envirofacts data their dependents. Over 12,000 doctors, nurses and ad-
warehouse comprises of information from 12 different ministrators use it. Frank Gillett, an analyst at Forrester B
environmental databases for facility information, in- Research, Inc., stated that, What kills these huge data
cluding toxic chemical releases, water discharge permit warehouse projects is that the human beings dont agree
compliance, hazardous waste handling processes, on the definition of data. Without that . . . all that $450
Superfund status, and air emission estimates. Each pro- million [cost of the warehouse project] could be thrown
gram office provides its own data and is responsible for out the window (Hamblen, 1998).
maintaining this data. Initially, the Envirofacts ware-
house architects noted some data integrity problems, Be Selective on what Data Elements to
namely, issues with accurate data, understandable data, Include in the Warehouse
properly linked data and standardized data. The archi-
tects had to work hard to address these key data issues so Users are unsure of what they want so they place an
that the public can trust that the quality of data in the excessive number of data elements in the warehouse.
warehouse (Garvey, 2003). This results in an immense, unwieldy warehouse in
U.S. Navy Type Commander Readiness Manage- which query performance is impaired.
ment System The Navy uses a data warehouse to Federal Credit Union - The data warehouse archi-
support the decisions of its commanding officers. Data tect for this organization suggests that users know which
at the lower unit levels is aggregated to the higher levels data they use most, although they will not always admit
and then interfaced with other military systems for a to what they use least (Deitch, 2000).
joint military assessment of readiness as required by the
Joint Chiefs of Staff. The Navy found that it was spend-
ing too much time to determine its readiness and some
Select the Extraction-Transformation-
of its reports contained incorrect data. The Navy devel- Loading (ETL) Strategy Carefully
oped a user friendly, Web-based system that provides
quick and accurate assessment of readiness data at all Having an effective ETL strategy that extracts data from
levels within the Navy. The system collects, stores, the various transactional systems, transforms the data to
reports and analyzes mission readiness data from air, sub a common format, and loads the data into a relational or
and surface forces for the Atlantic and Pacific Fleets. multidimensional database is the key to a successful
Although this effort was successful, the Navy learned that data warehouse project. If the ETL strategy is not effec-
data originating from the lower levels still needs to be tive, it will mean delays in refreshing the data ware-
accurate. The reason is that a number of legacy systems, house, contaminating the data warehouse with dirty data,
which serves as the source data for the warehouse, lacked and increasing the costs in maintaining the warehouse.
validation functions (Microsoft, 2000). IRS Compliance Warehouse supports research and
decision support, allows the IRS to analyze, develop,
Standardize the Organizations Data and implement business strategies for increasing volun-
tary compliance, improving productivity and managing
Definitions the organization. It also provides projections, forecasts,
quantitative analysis, and modeling. Users are able to
A key attribute of a data warehouse is that it serves as a query this data for decision support.
single version of the truth. This is a significant im- A major hurdle was to transform the large and di-
provement over the different and often conflicting ver- verse legacy online transactional data sets for effective
sions of the truth that come from an environment of use in an analytical architecture. They needed a way to
disparate silos of data. To achieve this singular version process custom hierarchical data files and convert to
of the truth, there needs to be consistent definitions of ASCII for local processing and mapping to relational
data elements to afford the consolidation of common databases. They ended up with developing a script pro-
information across different data sources. These con- gram that will do all of this. ETL is a major challenge and
sistent data definitions are captured in a data warehouses may be a showstopper for a warehouse implementa-
metadata repository. tion (Kmonk, 1999).
DoD Computerized Executive Information System
(CEIS) is a 4-terabyte data warehouse holds the medical
records of the 8.5 million active members of the U.S.
Leverage the Data Warehouse to
military health care system who are treated at 115 Provide Auditing Capability
hospitals and 461 clinics around the world. The Defense
Department wanted to convert its fixed-cost health care An overlooked benefit of data warehouses is its capabil-
system to a managed-care model to lower costs and ity of serving as an archive of historic knowledge that
increase patient care for the active military, retirees and can be used as an audit trail for later investigations.
95

TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective

U.S. Army Operational Testing and Evaluation Com- found that portals provide an effective way to access
mand (OPTEC) is charged with developing test criteria diverse data sources via a single screen (Kmonk, 1999).
and evaluating the performance of extremely complex weap-
ons equipment in every conceivable environment and Make Warehouse Data Available to All
condition. Moreover, as national defense policy is under- Knowledgeworkers (Not Only to
going a transformation, so do the weapon systems, and
thus the testing requirements. The objective of their ware-
Managers)
house was to consolidate a myriad of test data sets to
provide analysts and auditors with access to the specific The early data warehouses were designed to support
information needed to make proper decisions. upper management decision-making. However, over
OPTEC was having fits when audit agencies, such as time, organizations have realized the importance of
the General Accounting Office (GAO), would show up to knowledge sharing and collaboration and its relevance
investigate a weapon system. For instance, if problems to the success of the organizational mission. As a
with a weapon show up five years after it is introduced result, upper management has become aware of the
into the field, people are going to want to know what tests need to disseminate the functionality of the data ware-
were performed and the results of those tests. A ware- house throughout the organization.
house with its metadata capability made data retrieval IRS Compliance Data Warehouse supports a diver-
much more efficient (Microsoft, 2000). sity of user types economists, research analysts, and
statisticians all of whom are searching for ways to
improve customer service, increase compliance with
Leverage the Web and Web Portals for federal tax laws and increase productivity. It is not just
Warehouse Data to Reach Dispersed for upper management decision making anymore
Users (Kmonk, 1999).

In many organizations, users are geographically distrib- Supply Data in a Format Readable by
uted and the World Wide Web has been very effective as Spreadsheets
a gateway for these dispersed users to access the key
resources of their organization, which include data ware- Although online analytical tools such as those sup-
houses and data marts. ported by Cognos and Business Objects are useful for
U.S. Army OPTEC developed a Web-based front end data analysis, the spreadsheet is still the basic tool used
for its warehouse so that information can be entered and by most analysts.
accessed regardless of the hardware available to users. It U.S. Army OPTEC wanted users to transfer data and
supports the geographically dispersed nature of OPTECs work with information on applications that they are
mission. Users performing tests in the field can be familiar with. In OPTEC, they transfer the data into a
anywhere from Albany, New York to Fort Hood, Texas. format readable by spreadsheets so that analysts can
That is why the browser client the Army developed is so really crunch the data. Specifically, pivot tables found in
important to the success of the warehouse (Microsoft, spreadsheets allows the analysts to manipulate the infor-
2000). mation to put meaning behind the data (Microsoft, 2000).
DoD Defense Dental Standard System supports more
than 10,000 users at 600 military installations world-
wide. The solution consists of three main modules:
Restrict or Encrypt Classified/
Dental Charting, Dental Laboratory Management, and Sensitive Data
Workload and Dental Readiness Reporting. The charting
module helps dentists graphically record patient infor- Depending on requirements, a data warehouse can con-
mation. The lab module automates the workflow between tain confidential information that should not be re-
dentists and lab technicians. The reporting module al- vealed to unauthorized users. If privacy is breached, the
lows users to see key information though Web-based organization may become legally liable for damages
online reports, which is a key to the success of the and suffer a negative reputation with the ensuing loss of
defense dental operations. customers trust and confidence. Financial conse-
IRS Compliance Data Warehouse includes a Web- quences can result.
based query and reporting solution that provides high-value, DoD Computerized Executive Information System
easy-to-use data access and analysis capabilities, be quickly uses an online analytical processing tool from a popu-
and easily installed and managed, and scale to support lar vendor that could be used to restrict access to
hundreds of thousands of users. With this portal, the IRS certain data, such as HIV test results, so that any confi-
dential data would not be disclosed (Hamblen, 1998).

96

TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective

Considerations must be made in the architecting of a data Use Information Visualization


warehouse. One alternative is to use a roles-based archi- Techniques Such as Geographic B
tecture that allows access to sensitive data by only
authorized users and the encryption of data in the event
Information Systems (GIS)
of data interception.
A GIS combines layers of data about a physical location
to give users a better understanding of that location. GIS
Perform Analysis During the Data allows users to view, understand, question, interpret,
Collection Process and visualize data in ways simply not possible in para-
graphs of text or in the rows and columns of a spread-
Most data analyses involve completing data collection sheet or table.
before analysis can begin. With data warehouses, a new EPA Envirofacts Warehouse includes the capability
approach can be undertaken. of displaying its output via the EnviroMapper GIS sys-
Census 2000 Cost and Progress System was built to tem. It maps several types of environmental informa-
consolidate information from several computer sys- tion, including drinking water, toxic and air releases,
tems. The data warehouse allowed users to perform hazardous waste, water discharge permits, and Superfund
analyses during the data collection process; something sites at the national, state, and county levels (Garvey,
was previously not possible. The system allowed execu- 2003). Individuals familiar with Mapquest and other
tives to take a more proactive management role. With online mapping tools can easily navigate the system and
this system, Census directors, regional offices, manag- quickly get the information they need.
ers, and congressional oversight committees have the
ability to track the 2000 census, which never been done
before (SAS, 2000). FUTURE TRENDS
Leverage User Familiarity with Data warehousing will continue to grow as long as there
Browsers to Reduce Training are disparate silos of data sources throughout an organi-
Requirements zation. However, the irony is that there will be a prolif-
eration of data warehouses as well as data marts, which
The interface of a Web browser is very familiar to most will not interoperate within an organization. Some ex-
employees. Navigating through a learning management perts predict the evolution toward a federated architec-
system using a browser may be more user friendly than ture for the data warehousing environment. For ex-
using the navigation system of a proprietary training ample, there will be a common staging area for data
software. integration and, from this source, data will flow among
U.S. Department of Agriculture (USDA) Rural De- several data warehouses. This will ensure that the single
velopment, Office of Community Development, admin- truth requirement is maintained throughout the organi-
isters funding programs for the Rural Empowerment zation (Hackney, 2000).
Zone Initiative. There was a need to tap into legacy Another important trend in warehousing is one away
databases to provide accurate and timely rural funding from historic nature of data in warehouses and toward
information to top policy makers. Through Web acces- real-time distribution of data so that information vis-
sibility using an intranet system, there were dramatic ibility will be instantaneous (Carter, 2004). This is a key
improvements in financial reporting accuracy and timely factor for business decision-making in a constantly
access to data. Prior to the intranet, questions such as changing environment. Emerging technologies, namely
What were the Rural Development investments in 1997 service-oriented architectures and Web services, are
for the Mississippi Delta region? required weeks of expected to be the catalyst for this to occur.
laborious data gathering and analysis, yet yielded obso-
lete answers with only an 80 percent accuracy factor.
Now, similar analysis takes only a few minutes to per- CONCLUSION
form, and the accuracy of the data is as high as 98
percent. More than 7,000 Rural Development employ- An organization needs to understand how it can leverage
ees nationwide can retrieve the information at their data from a warehouse or mart to improve its level of
desktops, using a standard Web browser. Because em- service and the quality of its products and services.
ployees are familiar with the browser, they did not need Also, the organization needs to recognize that its most
training to use the new data mining system (Ferris, valuable resource, the workforce, needs to be adequately
2003). trained in accessing and utilizing a data warehouse. The

97

TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective

workforce should recognize the value of the knowledge Parker, G. (1999). Data warehousing at the federal govern-
that can be gained from data warehousing and how to ment: A CIO perspective. In Proceedings from Data
apply it to achieve organizational success. Warehouse Conference 99.
A data warehouse should be part of an enterprise
architecture, which is a framework for visualizing the PriceWaterhouseCoopers. (2001). Technology forecast.
information technology assets of an enterprise and how SAS. (2000). The U.S. Bureau of the Census counts on a
these assets interrelate. It should reflect the vision and better system.
business processes of an organization. It should also
include standards for the assets and interoperability Schwartz, A. (2000). Making the Web Safe. Federal Com-
requirements among these assets. puter Week.

REFERENCES KEY TERMS


AMS. (1999). Military marches toward next-genera- ASCII: American Standard Code for Information
tion health care service: The Defense Dental Stan- Interchange. Serves a code for representing English
dard System. characters as numbers with each letter assigned a num-
Carter, M. (2004). The death of data warehousing. Loosely ber from 0 to 127.
Coupled. Data Warehousing: A compilation of data designed
Deitch, J. (2000). Technicians are from Mars, users are to for decision support by executives, managers, ana-
from Venus: Myths and facts about data warehouse lysts and other key stakeholders in an organization. A
administration (Presentation). data warehouse contains a consistent picture of busi-
ness conditions at a single point in time.
Ferris, N. (1999). 9 hot trends for 99. Government Execu-
tive. Database: A collection of facts, figures, and objects
that is structured so that it can easily be accessed,
Ferris, N. (2003). Information is power. Government Ex- organized, managed, and updated.
ecutive.
Enterprise Architecture: A business and perfor-
Garvey, P. (2003). Envirofacts warehouse public ac- mance-based framework to support cross-agency col-
cess to environmental data over the Web (Presenta- laboration, transformation, and organization-wide im-
tion). provement.
Gerber, C. (1996). Feds turn to OLAP as reporting tool. Extraction-Transformation-Loading (ETL): A
Federal Computer Week. key transitional set of steps in migrating data from the
Hackney, D. (2000). Data warehouse delivery: The fed- source systems to the database housing the data ware-
erated future. DM Review. house. Extraction refers to drawing out the data from
the source system, transformation concerns converting
Hamblen, M. (1998). Pentagon to deploy huge medical the data to the format of the warehouse and loading
data warehouse. Computer World. involves storing the data into the warehouse.
Kirwin, B. (2003). Management update: Total cost of Geographic Information Systems: Map-based
ownership analysis provides many benefits. Gartner tools used to gather, transform, manipulate, analyze, and
Research, IGG-08272003-01. produce information related to the surface of the Earth.
Kmonk, J. (1999). Viador information portal provides Web Hierarchical Data Files: Database systems that
data access and reporting for the IRS. DM Review. are organized in the shape of a pyramid with each row of
objects linked to objects directly beneath it. This
Matthews, W. (2000). Digging digital gold. Federal Com- approach has generally been superceded by relationship
puter Week. database systems.
Microsoft Corporation. (2000). OPTEC adopts data ware- Knowledge Management: A concept where an organi-
housing strategy to test critical weapons systems. zation deliberately and comprehensively gathers, orga-
Microsoft Corporation. (2000). U.S. Navy ensures readi- nizes, and analyzes its knowledge, then shares it inter-
ness using SQL Server. nally and sometimes externally.

98

TEAM LinG
Best Practices in Data Warehousing from the Federal Perspective

Legacy System: Typically, a database management Terabyte: A unit of memory or data storage capacity
system in which an organization has invested consider- equal to roughly 1,000 gigabytes. B
able time and money and resides on a mainframe or
minicomputer. Total Cost of Ownership: Developed by Gartner
Group, an accounting method used by organizations
Outsourcing: Acquiring services or products from seeking to identify their both direct and indirect sys-
an outside supplier or manufacturer in order to cut costs tems costs.
and/or procure outside expertise.
Performance Metrics: Key measurements of sys-
tem attributes that is used to determine the success of NOTE
the process.
Pivot Tables: An interactive table found in most The views expressed in this article are those of the
spreadsheet programs that quickly combines and com- author and do not reflect the official policy or position
pares typically large amounts of data. One can rotate its of the National Defense University, the Department of
rows and columns to see different arrangements of the Defense or the U.S. Government.
source data, and also display the details for areas of
interest.

99

TEAM LinG
100

Bibliomining for Library Decision-Making


Scott Nicholson
Syracuse University School of Information Studies, USA

Jeffrey Stanton
Syracuse University School of Information Studies, USA

INTRODUCTION BACKGROUND

Most people think of a library as the little brick building Forward-thinking authors in the field of library science
in the heart of their community or the big brick building in began to explore sophisticated uses of library data some
the center of a college campus. However, these notions years before the concept of data mining became popular-
greatly oversimplify the world of libraries. Most large ized. Nutter (1987) explored library data sources to sup-
commercial organizations have dedicated in-house li- port decision making but lamented that the ability to
brary operations, as do schools; nongovernmental orga- collect, organize, and manipulate data far outstrips the
nizations; and local, state, and federal governments. With ability to interpret and to apply them (p. 143). Johnston
the increasing use of the World Wide Web, digital librar- and Weckert (1990) developed a data-driven expert sys-
ies have burgeoned, serving a huge variety of different tem to help select library materials, and Vizine-Goetz,
user audiences. With this expanded view of libraries, two Weibel, and Oskins (1990) developed a system for auto-
key insights arise. First, libraries are typically embedded mated cataloging based on book titles (see also Morris,
within larger institutions. Corporate libraries serve their 1992, and Aluri & Riggs, 1990). A special section of
corporations, academic libraries serve their universities, Library Administration and Management, Mining your
and public libraries serve taxpaying communities who automated system, included articles on extracting data
elect overseeing representatives. Second, libraries play a to support system management decisions (Mancini, 1996),
pivotal role within their institutions as repositories and extracting frequencies to assist in collection decision
providers of information resources. In the provider role, making (Atkins, 1996), and examining transaction logs to
libraries represent in microcosm the intellectual and learn- support collection management (Peters, 1996).
ing activities of the people who comprise the institution. More recently, Banerjeree (1998) focused on describ-
This fact provides the basis for the strategic importance ing how data mining works and how to use it to provide
of library data mining: By ascertaining what users are better access to the collection. Guenther (2000) discussed
seeking, bibliomining can reveal insights that have mean- data sources and bibliomining applications but focused
ing in the context of the librarys host institution. on the problems with heterogeneous data formats.
Use of data mining to examine library data might be Doszkocs (2000) discussed the potential for applying
aptly termed bibliomining. With widespread adoption of neural networks to library data to uncover possible asso-
computerized catalogs and search facilities over the past ciations between documents, indexing terms, classifica-
quarter century, library and information scientists have tion codes, and queries. Liddy (2000) combined natural
often used bibliometric methods (e.g., the discovery of language processing with text mining to discover informa-
patterns in authorship and citation within a field) to tion in digital library collections. Lawrence, Giles, and
explore patterns in bibliographic information. During the Bollacker (1999) created a system to retrieve and index
same period, various researchers have developed and citations from works in digital libraries. Gutwin, Paynter,
tested data-mining techniques, which are advanced sta- Witten, Nevill-Manning, and Frank (1999) used text min-
tistical and visualization methods to locate nontrivial ing to support resource discovery.
patterns in large datasets. Bibliomining refers to the use These projects all shared a common focus on improv-
of these bibliometric and data-mining techniques to ex- ing and automating two of the core functions of a library:
plore the enormous quantities of data generated by the acquisitions and collection management. A few authors
typical automated library. have recently begun to address the need to support
management by focusing on understanding library users:

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Bibliomining for Library Decision-Making

Schulman (1998) discussed using data mining to examine ILS Data Sources from the Creation of
changing trends in library user behavior; Sallis, Hill, the Library System B
Jancee, Lovette, and Masi (1999) created a neural network
that clusters digital library users; and Chau (2000) dis-
cussed the application of Web mining to personalize
Bibliographic Information
services in electronic reference.
The December 2003 issue of Information Technology One source of data is the collection of bibliographic
and Libraries was a special issue dedicated to the records and searching interfaces that represents materi-
bibliomining process. Nicholson presented an overview als in the library, commonly known as the Online Public
of the process, including the importance of creating a data Access Catalog (OPAC). In a digital library environment,
warehouse that protects the privacy of users. Zucca the same type of information collected in a bibliographic
discussed the implementation of a data warehouse in an library record can be collected as metadata. The concepts
academic library. Wormell; Surez-Balseiro, Iribarren- parallel those in a traditional library: Take an agreed-upon
Maestro, & Casado; and Geyer-Schultz, Neumann, & standard for describing an object, apply it to every object,
Thede used bibliomining in different ways to understand and make the resulting data searchable. Therefore, digital
the use of academic library sources and to create appro- libraries use conceptually similar bibliographic data
priate library services. sources to traditional libraries.
We extend these efforts by taking a more global view
of the data generated in libraries and the variety of Acquisitions Information
decisions that those data can inform. Thus, the focus of
this work is on describing ways in which library and Another source of data for bibliomining comes from
information managers can use data mining to understand acquisitions, where items are ordered from suppliers and
patterns of behavior among library users and staff and tracked until they are received and processed. Because
patterns of information resource use throughout the insti- digital libraries do not order physical goods, somewhat
tution. different acquisition methods and vendor relationships
exist. Nonetheless, in both traditional and digital library
environments, acquisition data have untapped potential
MAIN THRUST for understanding, controlling, and forecasting informa-
tion resource costs.
Integrated Library Systems and Data
ILS Data Sources from Usage of the
Warehouses
Library System
Most managers who wish to explore bibliomining will
need to work with the technical staff of their Integrated User Information
Library System (ILS) vendors to gain access to the data-
bases that underlie the system and create a data ware- In order to verify the identity of users who wish to use
house. The cleaning, preprocessing, and anonymizing of library services, libraries maintain user databases. In
the data can absorb a significant amount of time and effort. libraries associated with institutions, the user database is
Only by combining and linking different data sources, closely aligned with the organizational database. Sophis-
however, can managers uncover the hidden patterns that ticated public libraries link user records through zip codes
can help them understand library operations and users. with demographic information in order to learn more about
their user population. Digital libraries may or may not have
Exploration of Data Sources any information about their users, based upon the login
procedure required. No matter what data are captured
Available library data sources are divided into three about the patron, it is important to ensure that the iden-
groups for this discussion: data from the creation of the tification information about the patron is separated from
library, data from the use of the collection, and data from the demographic information before this information is
external sources not normally included in the ILS. stored in a data warehouse; doing so protects the privacy
of the individual.

101

TEAM LinG
Bibliomining for Library Decision-Making

Circulation and Usage Information fore, tracking in-house use is also vital in discovering
patterns of use. This task becomes much easier in a
The richest sources of information about library user digital library, as Web logs can be analyzed to discover
behavior are circulation and usage records. Legal and what sources the users examined.
ethical issues limit the use of circulation data, however. A
data warehouse can be useful in this situation, because Interlibrary Loan and Other Outsourcing
basic demographic information and details about the cir- Services
culation could be recorded without infringing upon the
privacy of the individual. Many libraries use interlibrary loan and/or other
Digital library services have a greater difficulty in outsourcing methods to get items on a need-by-need
defining circulation, as viewing a page does not carry the basis for users. The data produced by this class of
same meaning as checking a book out of the library, transactions will vary by service but can provide a
although requests to print or save a full text information window to areas of need in a library collection.
resource might be similar in meaning. Some electronic full-
text services already implement the server-side capture of Applications of Bibliomining through a
such requests from their user interfaces.
Data Warehouse
Searching and Navigation Information Bibliomining can provide an understanding of the indi-
vidual sources listed previously in this article; however,
The OPAC serves as the primary means of searching for much more information can be discovered when sources
works owned by the library. Additionally, because most are combined through common fields in a data ware-
OPACs use a Web browser interface, users may also house.
access bibliographic databases, the World Wide Web,
and other online resources during the same session; all
this information can be useful in library decision making.
Bibliomining to Improve Library Services
Digital libraries typically capture logs from users who are
searching their databases and can track, through Most libraries exist to serve the information needs of
clickstream analysis, the elements of Web-based services users, and therefore, understanding the needs of indi-
visited by users. In addition, the combination of a login viduals or groups is crucial to a librarys success. For
procedure and cookies allows the connection of user many decades, librarians have suggested works; market
demographics to the services and searches they used in a basket analysis can provide the same function through
session. usage data in order to aid users in locating useful works.
Bibliomining can also be used to determine areas of
deficiency and to predict future user needs. Common
External Data Sources areas of item requests and unsuccessful searches may
point to areas of collection weakness. By looking for
Reference Desk Interactions patterns in high-use items, librarians can better predict
the demand for new items.
In the typical face-to-face or telephone interaction with a Virtual reference desk services can build a database
library user, the reference librarian records very little infor- of questions and expert-created answers, which can be
mation about the interaction. Digital reference transac- used in a number of ways. Data mining could be used to
tions, however, occur through an electronic format, and discover patterns for tools that will automatically assign
the transaction text can be captured for later analysis, questions to experts based upon past assignments. In
which provides a much richer record than is available in addition, by mining the question/answer pairs for pat-
traditional reference work. The utility of these data can be terns, an expert system could be created that can provide
increased if identifying information about the user can be users an immediate answer and a pointer to an expert for
captured as well, but again, anonymization of these trans- more information.
actions is a significant challenge.
Bibliomining for Organizational Decision
Item Use Information Making Within the Library
Fussler and Simon (as cited in Nutter, 1987) estimated that Just as the user behavior is captured within the ILS, the
75 to 80% of the use of materials in academic libraries is in behavior of library staff can also be discovered by con-
house. Some types of materials never circulate, and there-

102

TEAM LinG
Bibliomining for Library Decision-Making

necting various databases to supplement existing perfor- FUTURE TRENDS


mance review methods. Although monitoring staff through B
their performance may be an uncomfortable concept, Consortial Data Warehouses
tighter budgets and demands for justification require
thoughtful and careful performance tracking. In addition, One future path of bibliomining is to combine the data
research has shown that incorporating clear, objective from multiple libraries through shared data warehouses.
measures into performance evaluations can actually im- This merger will require standards if the libraries use
prove the fairness and effectiveness of those evaluations different systems. One such standard is the COUNTER
(Stanton, 2000). project (2004), which is a standard for reporting the use of
Low-use statistics for a work may indicate a problem digital library resources. Libraries working together to
in the selection or cataloging process. Looking at the pool their data will be able to gain a competitive advantage
associations between assigned subject headings, call over publishers and have the data needed to make better
numbers, and keywords, along with the responsible party decisions. This type of data warehouse can power evi-
for the catalog record, may lead to a discovery of system dence-based librarianship, another growing area of re-
inefficiencies. Vendor selection and price can be exam- search (Eldredge, 2000).
ined in a similar fashion to discover if a staff member Combining these data sources will allow library sci-
consistently uses a more expensive vendor when cheaper ence research to move from making statements about a
alternatives are available. Most libraries acquire works particular library to making generalizations about
both by individual orders and through automated order- librarianship. These generalizations can then be tested on
ing plans that are configured to fit the size and type of that other consortial data warehouses and in different settings
library. Although these automated plans do simplify the and may be the inspiration for theories. Bibliomining and
selection process, if some or many of the works they other forms of evidence-based librarianship can therefore
recommend go unused, then the plan might not be cost encourage the expansion of the conceptual and theoreti-
effective. Therefore, merging the acquisitions and circu- cal frameworks supporting the science of librarianship.
lation databases and seeking patterns that predict low use
can aid in appropriate selection of vendors and plans.
Bibliomining, Web Mining, and Text
Bibliomining for External Reporting and Mining
Justification Web mining is the exploration of patterns in the use of
Web pages. Bibliomining uses Web mining as its base but
The library may often be able to offer insights to their adds some knowledge about the user. This aids in one of
parent organization or community about their user base the shortcomings of Web mining many times, nothing
through patterns detected with bibliomining. In addition, is known about the user. This lack still holds true in some
library managers are often called upon to justify the digital library applications; however, when users access
funding for their library when budgets are tight. Likewise, password-protected areas, the library has the ability to
managers must sometimes defend their policies, particu- map some information about the patron onto the usage
larly when faced with user complaints. Bibliomining can information. Therefore, bibliomining uses tools from Web
provide the data-based justification to back up the anec- usage mining but has more data available for pattern
dotal evidence usually used for such arguments. discovery.
Bibliomining of circulation data can provide a number Text mining is the exploration of the context of text in
of insights about the groups who use the library. By order to extract information and understand patterns. It
clustering the users by materials circulated and tying helps to add information to the usage patterns discovered
demographic information into each cluster, the library can through bibliomining. To use terms from information
develop conceptual user groups that provide a model of science, bibliomining focuses on patterns in the data that
the important constituencies of the institutions user label and point to the information container, while text
base; this grouping, in turn, can fulfill some common mining focuses on the information within that container.
organizational needs for understanding where common In the future, organizations that fund digital libraries can
interests and expertise reside in the user community. This look to text mining to greatly improve access to materials
capability may be particularly valuable within large orga- beyond the current cataloging/metadata solutions.
nizations where research and development efforts are The quality and speed of text mining continues to
dispersed over multiple locations. improve. Liddy (2000) has researched the extraction of

103

TEAM LinG
Bibliomining for Library Decision-Making

information from digital texts; implementing these tech- REFERENCES


nologies can allow a digital library to move from suggest-
ing texts that might contain the answer to just providing Atkins, S. (1996). Mining automated systems for collec-
the answer by extracting it from the appropriate text or tion management. Library Administration & Manage-
texts. The use of such tools risks taking textual material ment, 10(1), 16-19.
out of context and also provides few hints about the
quality of the material, but if these extractions were links Banerjee, K. (1998). Is data mining right for your library?
directly into the texts, then context could emerge along Computer in Libraries, 18(10), 28-31.
with an answer. This situation could provide a substantial Chau, M. Y. (2000). Mediating off-site electronic reference
asset to organizations that maintain large bodies of tech- services: Human-computer interactions between libraries
nical texts, because it would promote rapid, universal and Web mining technology. IEEE Fourth International
access to previously scattered and/or uncataloged mate- Conference on Knowledge-Based Intelligent Engineer-
rials. ing Systems & Allied Technologies, USA, 2 (pp. 695-699).

Example of Hybrid Approach Chaudhry, A. S. (1993). Automation systems as tools of


use studies and management information. IFLA Journal,
Hwang and Chuang (in press) have recently combined 19(4), 397-409.
bibliomining, Web mining, and text mining in a COUNTER (2004). COUNTER: Counting online usage of
recommender system for an academic library. They started networked electronic resources. Retrieved from http://
by using data mining on Web usage data for articles in a www.projectcounter.org/about.html
digital library and combining that information with infor-
mation about the users. They then built a system that Doszkocs, T. E. (2000). Neural networks in libraries: The
looked at patterns between works based on their content potential of a new information technology. Retrieved
by using text mining. By combing these two systems into from http://web.simmons.edu/~chen/nit/NIT%2791/
a hybrid system, they found that the hybrid system 027~dos.htm
provides more accurate recommendations for users than
either system taken separately. This example is a perfect Eldredge, J. (2000). Evidence-based librarianship: An
representation of the future of bibliomining and how it can overview. Bulletin of the Medical Library Association,
be used to enhance the text-mining research projects 88(4), 289-302.
already in progress. Geyer-Schulz, A., Neumann, A., & Thede, A. (2003). An
architecture for behavior-based library recommender sys-
tems. Information Technology and Libraries, 22(4), 165-
CONCLUSION 174.

Libraries have gathered data about their collections and Guenther, K. (2000). Applying data mining principles to
users for years but have not always used those data for library data collection. Computers in Libraries, 20(4), 60-
better decision making. By taking a more active approach 63.
based on applications of data mining, data visualization, Gutwin, C., Paynter, G., Witten, I., Nevill-Manning, C., &
and statistics, these information organizations can get a Frank, E. (1999). Improving browsing in digital libraries
clearer picture of their information delivery and manage- with keyphrase indexes. Decision Support Systems, 2I,
ment needs. At the same time, libraries must continue to 81-104.
protect their users and employees from the misuse of
personally identifiable data records. Information discov- Hwang, S., & Chuang, S. (in press). Combining article
ered through the application of bibliomining techniques content and Web usage for literature recommendation in
gives the library the potential to save money, provide digital libraries. Online Information Review.
more appropriate programs, meet more of the users infor- Johnston, M., & Weckert, J. (1990). Selection advisor: An
mation needs, become aware of the gaps and strengths of expert system for collection development. Information
their collection, and serve as a more effective information Technology and Libraries, 9(3), 219-225.
source for its users. Bibliomining can provide the data-
based justifications for the difficult decisions and fund- Lawrence, S., Giles, C. L., & Bollacker, K. (1999). Digital
ing requests library managers must make. libraries and autonomous citation indexing. IEEE Com-
puter, 32(6), 67-71.

104

TEAM LinG
Bibliomining for Library Decision-Making

Liddy, L. (2000, November/December). Text mining. Bul- KEY TERMS


letin of the American Society for Information Science, 13- B
14. Bibliometrics: The study of regularities in citations,
Mancini, D. D. (1996). Mining your automated system for authorship, subjects, and other extractable facets from
systemwide decision making. Library Administration & scientific communication by using quantitative and visu-
Management, 10(1), 11-15. alization techniques. This study allows researchers to
understand patterns in the creation and documented use
Morris, A. (Ed.). (1992). Application of expert systems in of scholarly publishing.
library and information centers. London: Bowker-Saur.
Bibliomining: The application of statistical and pat-
Nicholson, S. (2003). The bibliomining process: Data tern-recognition tools to large amounts of data associ-
warehousing and data mining for library decision-making. ated with library systems in order to aid decision making
Information Technology and Libraries, 22(4), 146-151. or to justify services. The term bibliomining comes from
the combination of bibliometrics and data mining, which
Nutter, S. K. (1987). Online systems and the management are the two main toolsets used for analysis.
of collections: Use and implications. Advances in Library
Automation Networking, 1, 125-149. Data Warehousing: The gathering and cleaning of
data from disparate sources into a single database, which
Peters, T. (1996). Using transaction log analysis for library is optimized for exploration and reporting. The data ware-
management information. Library Administration & house holds a cleaned version of the data from operational
Management, 10(1), 20-25. systems, and data mining requires the type of cleaned
Sallis, P., Hill, L., Janee, G., Lovette, K., & Masi, C. (1999). data that live in a data warehouse.
A methodology for profiling users of large interactive Evidence-Based Librarianship: The use of the best
systems incorporating neural network data mining tech- available evidence, combined with the experiences of
niques. Proceedings of the 1999 Information Resources working librarians and the knowledge of the local user
Management Association International Conference (pp. base, to make the best decisions possible (Eldredge,
994-998). 2000).
Schulman, S. (1998). Data mining: Life after report genera- Integrated Library System: The automation system
tors. Information Today, 15(3), 52. for libraries that combines modules for cataloging, acqui-
Stanton, J. M. (2000). Reactions to employee performance sition, circulation, end-user searching, database access,
monitoring: Framework, review, and research directions. and other library functions through a common set of
Human Performance, 13, 85-113. interfaces and databases.

Surez-Balseiro, C. A., Iribarren-Maestro, I., Casado, E. S. Online Public Access Catalog (OPAC): The module
(2003). A study of the use of the Carlos III University of of the Integrated Library System designed for use by the
Madrid Librarys online database service in Scientific public to allow discovery of the librarys holdings through
Endeavor. Information Technology and Libraries, 22(4), the searching of bibliographic surrogates. As libraries
179-182. acquire more digital materials, they are linking those
materials to the OPAC entries.
Vizine-Goetz, D., Weibel, S., & Oskins, M. (1990). Auto-
mating descriptive cataloging. In R. Aluri, & D. Riggs
(Eds.), Expert systems in libraries (pp. 123-127). Norwood,
NJ: Ablex Publishing Corporation. NOTE
Wormell, I. (2003). Matching subject portals with the This work is based on Nicholson, S., & Stanton, J. (2003).
research environment. Information Technology and Li- Gaining strategic advantage through bibliomining: Data
braries, 22(4), 158-166. mining for management decisions in corporate, special,
Zucca, J. (2003). Traces in the clickstream: Early work on digital, and traditional libraries. In H. Nemati & C. Barko
a management information repository at the University of (Eds.). Organizational data mining: Leveraging enter-
Pennsylvania. Information Technology and Libraries, prise data resources for optimal performance (pp. 247
22(4), 175-178. 262). Hershey, PA: Idea Group.

105

TEAM LinG
106

Biomedical Data Mining Using RBF Neural


Networks
Feng Chu
Nanyang Technological University, Singapore

Lipo Wang
Nanyang Technological University, Singapore

INTRODUCTION Thus, to develop accurate and efficient classifiers


based on gene expression becomes a problem of both
Accurate diagnosis of cancers is of great importance for theoretical and practical importance. Recent approaches
doctors to choose a proper treatment. Furthermore, it also on this problem include artificial neural networks (Khan et
plays a key role in the searching for the pathology of al., 2001), support vector machines (Guyon, Weston,
cancers and drug discovery. Recently, this problem at- Barnhill, & Vapnik, 2002), k-nearest neighbor (Olshen &
tracts great attention in the context of microarray technol- Jain, 2002), nearest shrunken centroids (Tibshirani, Hastie,
ogy. Here, we apply radial basis function (RBF) neural Narashiman, & Chu, 2002), and so on.
networks to this pattern recognition problem. Our experi- A solution to this problem is to find out a group of
mental results in some well-known microarray data sets important genes that contribute most to differentiate
indicate that our method can obtain very high accuracy cancer subtypes. In the meantime, we should also provide
with a small number of genes. proper algorithms that are able to make correct prediction
based on the expression profiles of those genes. Such
work will benefit early diagnosis of cancers. In addition,
BACKGROUND it will help doctors choose proper treatment. Furthermore,
it also throws light on the relationship between the can-
Microarray is also called gene chip or DNA chip. It is a cers and those important genes.
newly appeared biotechnology that allows biomedical From the point of view of machine learning and statis-
researchers monitor thousands of genes simultaneously tical learning, cancer classification using gene expression
(Schena, Shalon, Davis, & Brown, 1995). Before the ap- profiles is a challenging problem. The reason lies in the
pearance of microarrays, a traditional molecular biology following two points. First, typical gene expression data
experiment usually works on only one gene or several sets usually contain very few samples (from several to
genes, which makes it difficult to have a whole picture several tens for each type of cancers). In other words, the
of an entire genome. With the help of microarrays, re- training data are scarce. Second, such data sets usually
searchers are able to monitor, analyze and compare ex- contain a large number of genes, for example, several
pression profiles of thousands of genes in one experiment. thousands. That is, the data are high dimensional. There-
On account of their features, microarrays have been fore, this is a special pattern recognition problem with
used in various tasks such as gene discovery, disease relatively small number of patterns and very high dimen-
diagnosis, and drug discovery. Since the end of the last sionality. To provide such a problem with a good solution,
century, cancer classification based on gene expression appropriate algorithms should be designed.
profiles has attracted great attention in both the biologi- In fact, a number of different approaches such as k-
cal and the engineering fields. Compared with traditional nearest neighbor (Olshen and Jain, 2002), support vector
cancer diagnostic methods based mainly on the morpho- machines (Guyon et al.,2002), artificial neural networks
logical appearances of tumors, the method using gene (Khan et al., 2001) and some statistical methods have
expression profiles is more objective, accurate, and reli- been applied to this problem since 1995. Among these
able. More importantly, some types of cancers have approaches, some obtained very good results. For ex-
subtypes with very similar appearances that are very hard ample, Khan et al. (2001) classified small round blue cell
to be classified by traditional methods. It has been proven tumors (SRBCTs) with 100% accuracy by using 96 genes.
that gene expression has a good capability to clarify this Tibshirani et al. (2002) successfully classified SRBCTs
previously muddy problem. with 100% accuracy by using only 43 genes. They also

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Biomedical Data Mining Using RBF Neural Networks

classified three different subtypes of lymphoma with the systemic bias induced during experiments. We follow
100% accuracy by using 48 genes. (Tibshirani, Hastie, the normalization procedure used by Dudoit, Fridlyand, B
Narashiman, & Chu, 2003) and Speed (2002). Three preprocessing steps were ap-
However, there are still a lot of things can be done to plied: (a) thresholding with floor of 100 and ceiling of
improve present algorithms. In this work, we use and 16000; (b) filtering, exclusion of genes with max/min<5 or
compare two gene selection schemes, i.e., principal com- (max-min)<500. max and min refer to the maximum and the
ponents analysis (PCA) (Simon, 1999) and a t-test-based minimum of the gene expression values, respectively; and
method (Tusher, Tibshirani, & Chu, 2001). After that, we (c) base 10 logarithmic transformation. There are 3571
introduce an RBF neural network (Fu & Wang, 2003) as the genes survived after these three steps. After that, the data
classification algorithm. were standardized across experiments, i.e., minus the
mean and divided by the standard deviation of each
experiment.
MAIN THRUST
Methods for Gene Selection
After a comparative study of gene selection methods, a
detailed description of the RBF neural network and some As mentioned in the former part, the gene expression data
experimental results are presented in this section. are very high-dimensional. The dimension of input pat-
terns is determined by the number of genes used. In a
Microarray Data Sets typical microarray experiment, usually several thousands
of genes take part in. Therefore, the dimension of patterns
We analyze three well-known gene expression data sets, is several thousands. However, only a small number of the
i.e., the SRBCT data set (Khan et al., 2001), the lymphoma genes contribute to correct classification; some others
data set (Alizadeh et al., 2000), and the leukemia data set even act as noise. Gene selection can eliminate the
(Golub et al., 1999). influence of such noise. Furthermore, the fewer the
The lymphoma data set (http://llmpp.nih.gov/lym- genes used, the lower the computational burden to the
phoma) (Alizadeh et al., 2000) contains 4026 well mea- classifier. Finally, once a smaller subset of genes is iden-
sured clones belonging to 62 samples. These samples tified as relevant to a particular cancer, it helps biomedical
belong to following types of lymphoid malignancies: researchers focus on these genes that contribute to the
diffuse large B-cell lymphoma (DLBCL, 42 samples), fol- development of the cancer. The process of gene selection
licular lymphoma (FL, nine samples) and chronicle lym- is ranking genes discriminative ability first and then
phocytic leukemia (CLL, 11 samples). In this data set, a retaining the genes with high ranks.
small part of data is missing. A k-nearest neighbor algo- As a critical step for classification, gene selection has
rithm was used to fill those missing values (Troyanskaya been studied intensively in recent years. There are two
et al., 2001). main approaches, one is principal component analysis
The SRBCT data set (http://research.nhgri.nih.gov/ (PCA) (Simon, 1999), perhaps the most widely used method;
microarray/Supplement/) (Khan et al., 2001) contains the the other is a t-test-based approach which has been more
expression data of 2308 genes. There are totally 63 training and more widely accepted. In the important papers
samples and 25 testing samples. Five of the testing samples (Alizadeh et al., 2000; Khan et al., 2001), PCA was used.
are not SRBCTs. The 63 training samples contain 23 Ewing The basic idea of PCA is to find the most informative
family of tumors (EWS), 20 rhabdomyosarcoma (RMS), 12 genes that contain most of the information in the data set.
neuroblastoma (NB), and eight Burkitt lymphomas (BL). Another approach is based on t-test that is able to mea-
And the 20 testing samples contain six EWS, five RMS, six sure the difference between two groups. Thomas, Olsen,
NB, and three BL. Tapscott, and Zhao. (2001) recommended this method.
The leukemia data set (http://www-genome.wi.mit.edu/ Tusher et al. (2001) and Pan (2002) also proposed their
cgi-\\bin /cancer/publications) (Golub et al., 1999) has method based on t-test, respectively. Besides these two
two types of leukemia, i.e., acute myeloid leukemia (AML) main methods, there are also some other methods. For
and acute lymphoblastic leukemia (ALL). Among these example, a method called Markov blanket was proposed
samples, 38 of them are for training; the other 34 blind by Xing, Jordan, and Karp (2001). Li, Weinberg, Darden,
samples are for testing. The entire leukemia data set and Pedersen (2001) applied another method which com-
contains the expression data of 7,129 genes. Different bined genetic algorithm and K-nearest neighbor.
with the cDNA microarray data, the leukemia data are PCA (Simon, 1999) aims at reducing the input dimen-
oligonucleotide microarray data. Because such expres- sion by transforming the input space into a new space
sion data are raw data, we need to normalize them to reduce described by principal components (PCs). All the PCs are

107

TEAM LinG
Biomedical Data Mining Using RBF Neural Networks

orthogonal and they are ordered according to the absolute total number of samples. x i is the general mean expres-
value of their eigenvalues. The k-th PC is the vector with sion value for gene i. s i is the pooled within-class
the k-th largest eigenvalue. By leaving out the vectors with standard deviation for gene i. Actually, the t-score used
small eigenvalues, the input spaces dimension is reduced. here is a t-statistics between a specific class and the
In fact, the PCs indicate the directions with largest overall centroid of all the classes.
variations of input vectors. Because PCA chooses vectors To compare the t-test-based method with PCA, we
with largest eigenvalues, it covers directions with largest also applied it to the lymphoma data set with the same
variations of vectors. In the directions determined by the procedure as what we did by using PCA. This method
vectors with small eigenvalues, the variations of vectors obtained 100% accuracy with only the top six genes. The
are very small. In a word, PCA intends to capture the most results are shown in Figure 1. This comparison indicated
informative directions (Simon, 1999). that the t-test-based method was much better than PCA
We tested PCA in the lymphoma data set (Alizadeh et in this problem.
al., 2000). We obtained 62 PCs from the 4026 genes in the
data set by using PCA. Then, we ranked those PCs accord- An RBF Neural Network
ing to their eigenvalues (absolute values). Finally, we used
our RBF neural network that will be introduced in the latter
An RBF neural network (Haykin, 1999) has three layers.
part to classify the lymphoma data set.
The first layer is an input layer; the second layer is a
At first, we randomly divided the 62 samples into two
hidden layer that includes some radial basis functions,
parts, 31 samples for training and the other 31 samples for
also known as hidden kernels; the third layer is an output
testing. We then input the 62 PCs one by one to the RBF
layer. An RBF neural network can be regarded as a
network according to their eigenvalue ranks starting with
mapping of the input domain X onto the output domain
the PC ranked one. That is, we first used only a single PC
Y. Mathematically, an RBF neural network can be de-
that is ranked 1 as the input to the RBF network. We trained
scribed as follows:
the network with the training data and subsequently tested
the network with the testing data. We repeated this pro-
N
cess with the top two PCs, then the top three PCs, and so y m ( x) = wmi G ( x t i ) + bm , i=1,2,,N; m=1,2,M
on. Figure 1 shows the testing error. From this result, we i =1

found that the RBF network can not reach 100% accuracy.
The best testing accuracy is 93.55% that happened when 36 Here stands for the Euclidean norm. M is the
or 61 PCs were input to the classifier. The classification result
number of outputs. N is the number of hidden kernels.
using the t-test-based gene selection method will be shown
in the next section, which is much better than PCA approach. y m (x) is the output m corresponding to the input x. t i is
The t-test-based gene selection measures the differ- the position of kernel i. wmi is the weight between the
ence of genes distribution using a t-test based scoring kernel i and the output m. bm is the bias on the output m.
scheme, i.e., t-score (TS). After that, only the genes with G ( x t i ) is the kernel function. Usually, an RBF neural
the highest TSs are to be put into our classifier. The TS of
network uses Gaussian kernel functions as follows:
gene i is defined as follows (Tusher et al., 2001):

x ik x i
TS i = max , k = 1,2,...K
d
k i s
Figure 1. Classification results of using PCA and the t-
test-based method as gene selection methods
x ik = jC k x ij / nk
n
x i = xij / n 1
j =1

(x )
where: 1 2 0. 8
si2 = ij x ik
nK k
jC k
0. 6
PCA
Er r or

d k = 1 / nk + 1 / n
t - t est
0. 4

0. 2
There are K classes. max {yk, k = 1,2,..K} is the maximum
of all y k, k = 1,2,..K. Ck refers to class k that includes nk 0
1 6 11 16 21 26 31 36 41 46 51 56 61
samples. xij is the expression value of gene i in sample j. x ik
Number of genes
is the mean expression value in class k for gene i. n is the

108

TEAM LinG
Biomedical Data Mining Using RBF Neural Networks

2 FUTURE TRENDS
B
x t i
G ( x t i ) = exp( 2
)
2 i
Until now, the focus of work is investigating the informa-
tion with statistical importance in microarray data sets. In
where i is the radius of the kernel i. the near future, we will try to incorporate more biological
The main steps to construct an RBF neural network knowledge into our algorithm, especially the correlations
include: (a) determining the positions of all the kernels (ti); of genes.
(b) determining the radius of each kernel ( i ); and (c) In addition, with more and more microarray data sets
calculating the weights between each kernel and each produced in laboratories around the world, we will try to
output node. mine multi-data-set with our RBF neural network, i.e., we
In this paper, we use a novel RBF neural network will try to process the combined data sets. Such an attempt
proposed by Fu and Wang (Fu and Wang, 2003), which will hopefully bring us a much broader and deeper insight
allows for large overlaps of hidden kernels belonging to into those data sets.
the same class.

Results CONCLUSION

In the SRBCT data set, we first ranked the entire 2308 Through our experiments, we conclude that the t-test-
genes according to their TSs (Tusher et al., 2001). Then based gene selection method is an appropriate feature
we picked out 96 genes with the highest TSs. We applied selection/dimension reduction approach, which can find
our RBF neural network to classify the SRBCT data set. more important genes than PCA can.
The SRBCT data set contains 63 samples for training and The results in the SRBCT data set and the leukemia
20 blind samples for testing. We input the selected 96 data set proved the effectiveness of our RBF neural
genes one by one to the RBF network according to their network. In the SRBCT data set, it obtained 100% accu-
TS ranks starting with the gene ranked one. We repeated racy with only seven genes. In the leukemia data set, it
this process with the top two genes, then the top three made only one error with 12, 20, 22, and 32 genes, respec-
genes, and so on. Figure 2 shows the testing errors with tively. In view of this, we also conclude that our RBF
respect to the number of genes. The testing error decreased neural network outperforms almost all the previously
to 0 when the top seven genes were input into the RBF published methods in terms of accuracy and the number
network. of genes required.
In the leukemia data set, we chose 56 genes with the
highest TSs (Tusher et al., 2001). We followed the same
procedure as in the SRBCT data set. We did classification REFERENCES
with 1 gene, then two genes, then three genes and so on.
Our RBF neural network got an accuracy of 97.06%, i.e. Alizadeh, A.A. et al. (2000). Distinct types of diffuse large
one error in all 34 samples, when 12, 20, 22, 32 genes were B-cell lymphoma identified by gene expression profiling.
input, respectively. Nature, 403, 503-511.

Figure 2. The testing result in the SRBCT data set Figure 3. The testing result in the leukemia data set

0. 5 0. 2

0. 4 0. 15
0. 3
Er r or
Er r or

0. 1
0. 2
0. 05
0. 1

0 0
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55
Number of genes Number of genes

109

TEAM LinG
Biomedical Data Mining Using RBF Neural Networks

Dudoit, S., Fridlyand, J., & Speed, J. (2002). Comparison Tusher, V.G., Tibshirani, R., & Chu, G. (2001). Significance
of discrimination methods for the classification of tumors analysis of microarrays applied to the ionizing radiation
using gene expression data. Journal of American Statis- response. Proc. Natl. Acad. Sci. USA, 98, 5116-5121.
tics Association, 97, 77-87.
Xing, E.P., Jordan, M.I., & Karp, R.M. (2001). Feature
Fu, X., & Wang, L. (2003). Data dimensionality reduction selection for high-dimensional genomic microarray data.
with application to simplifying RBF neural network struc- Proceedings of the Eighteenth International Conference
ture and improving classification performance. IEEE Trans. on Machine Learning (pp. 601-608). Morgan Kaufmann
Syst., Man, Cybernetics. Part B: Cybernetics, 33, 399-409. Publishers, Inc.
Golub, T.R. et al. (1999). Molecular classification of can-
cer: class discovery and class prediction by gene expres- KEY TERMS
sion monitoring. Science, 286, 531-537.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Feature Extraction: Feature extraction is the process
Gene selection for cancer classification using support to obtain a group of features with the characters we need
vector machines. Machine Learning, 46, 389-422. from the original data set. It usually uses a transform (e.g.
principal component analysis) to obtain a group of fea-
Haykin, S. (1999). Neural network, a comprehensive tures at one time of computation.
foundation (2nd ed.). New Jersey, U.S.A: Prentice-Hall, Inc.
Feature Selection: Feature selection is the process to
Khan, J.M. et al. (2001). Classification and diagnostic select some features we need from all the original features.
prediction of cancers using gene expression profiling and It usually measures the character (e.g. t-test score) of each
artificial neural networks. Nature Medicine, 7, 673-679. feature first, then, chooses some features we need.
Li, L., Weinberg, C.R., Darden, T.A., & Pedersen, L.G. Gene Expression Profile: Through microarray chips,
(2001). Gene selection for sample classification based on an image that describes to what extent genes are ex-
gene expression data: Study of sensitivity to choice of pressed can be obtained. It usually uses red to indicate the
parameters of the GA/KNN method. Bioinformatics, 17, high expression level and uses green to indicate the low
1131-1142. expression level. This image is also called a gene expres-
sion profile.
Olshen, A.B., & Jain, A.N. (2002). Deriving quantitative
conclusions from microarray expression data. Microarray: A Microarray is also called a gene chip
Bioinformatics, 18, 961-970. or a DNA chip. It is a newly appeared biotechnology that
allows biomedical researchers monitor thousands of genes
Pan, W. (2002). A comparative review of statistical meth- simultaneously.
ods for discovering differentially expressed genes in repli-
cated microarray experiments. Bioinformatics, 18, 546-554. Principal Components Analysis: Principal compo-
nents analysis transforms one vector space into a new
Schena, M., Shalon, D., Davis, R.W., & Brown, P.O. (1995). space described by principal components (PCs). All the
Quantitative monitoring of gene expression patterns with PCs are orthogonal to each other and they are ordered
a complementary DNA microarray. Science, 270, 467-470. according to the absolute value of their eigenvalues. By
Thomas, J.G., Olsen, J.M., Tapscott, S.J. & Zhao, L.P. leaving out the vectors with small eigenvalues, the dimen-
(2001). An efficient and robust statistical modeling ap- sion of the original vector space is reduced.
proach to discover differentially expressed genes using ge- Radial Basis Function (RBF) Neural Network: An
nomic expression profiles. Genome Research, 11, 1227-1236. RBF neural network is a kind of artificial neural network.
Tibshirani, R., Hastie, T., Narashiman, B., & Chu, G. (2002). It usually has three layers, i.e., an input layer, a hidden
Diagnosis of multiple cancer types by shrunken centroids of layer, and an output layer. The hidden layer of an RBF
gene expression. Proc. Natl. Acad. Sci. USA, 99, 6567-6572. neural network contains some radial basis functions, such
as Gaussian functions or polynomial functions, to trans-
Tibshirani, R., Hastie, T., Narashiman, B., & Chu, G. (2003). form input vector space into a new non-linear space. An
Class predication by nearest shrunken centroids with applica- RBF neural network has the universal approximation abil-
tions to DNA microarrays. Statistical Science, 18, 104-117. ity, i.e., it can approximate any function to any accuracy,
Troyanskaya, O., Cantor, M, & Sherlock, G., et al. (2001). as long as there are enough hidden neurons.
Missing value estimation methods for DNA microarrays.
Bioinformatics, 17, 520-525.

110

TEAM LinG
Biomedical Data Mining Using RBF Neural Networks

T-Test: T-test is a kind of statistical method that Training a Neural Network: Training a neural net-
measures how large the difference is between two groups work means using some known data to build the structure B
of samples. and tune the parameters of this network. The goal of training
is to make the network represent a mapping or a regression we
Testing a Neural Network: To know whether a trained need.
neural network is the mapping or the regression we need,
we test this network with some data that have not been
used in the training process. This procedure is called
testing a neural network.

111

TEAM LinG
112

Building Empirical-Based Knowledge for


Design Recovery
Hee Beng Kuan Tan
Nanyang Technological University, Singapore

Yuan Zhao
Nanyang Technological University, Singapore

INTRODUCTION discusses the application of the proposed approach for the


recovery of functional dependencies enforced in database
Although the use of statistically probable properties is transactions. The final section shows our conclusion.
very common in the area of medicine, it is not so in
software engineering. The use of such properties may
open a new avenue for the automated recovery of de- MAIN THRUST
signs from source codes. In fact, the recovery of de-
signs can also be called program mining, which in turn The Approach
can be viewed as an extension of data mining to the
mining in program source codes. Many types of designs are usually implemented through
a few methods. The use of a method has a direct influ-
ence on the programs that implement the designs. As a
BACKGROUND result, these programs may have some certain charac-
teristics. And we may be able to recognize the designs or
Today, most of the tasks in software verification, test- their properties through recognizing these characteris-
ing, and re-engineering remain manually intensive tics, from either a theoretical or empirical basis or the
(Beizer, 1990), time-consuming, and error prone. As combination of the two.
many of these tasks require the recognition of designs An overview of the approach for building empirical-
from program source codes, automation of the recogni- based knowledge for design recovery through program
tion is an important means to improve these tasks. analysis is shown in Figure 1. In the figure, arcs show
However, many designs are difficult (if not impossible) interactions between tasks. In the approach, we first
to recognize automatically from program source codes research the designs or their properties, which can be
through theoretical knowledge alone (Biggerstaff, recognized from some characteristics in the programs
Mitbander, & Webster, 1994; Kozaczynski, Ning, & that implement them through automated program analy-
Engberts, 1992). Most of the approaches proposed for sis. This task requires domain knowledge or experience.
the recognition of designs from program source codes The reason for using design properties is that some
are based on plausible inference (Biggerstaff et al., designs could be too complex to recognize directly. In
1994; Ferrante, Ottenstein, & Warren, 1987; the latter case, we first recognize the properties, then
Kozaczynski et al., 1992). That is, they are actually use them to infer the designs. We aim for characteris-
based on empirical-based knowledge (Kitchenham, tics that are not only sufficient but also necessary for
Pfleeger, Pickard, Jones, Hoaglin, Emam, & Rosenberg, recognizing designs or their properties. If the charac-
2002). However, to the best of our knowledge, the teristics are sufficient but not necessary, even if we
building of empirical-based knowledge to supplement cannot infer the target designs, they do not imply the
theoretical knowledge for the recognition of designs nonexistence of the designs. If we cannot find charac-
from program source code has not been formally dis- teristics from which a design can be formally proved,
cussed in the literature. then we will look for characteristics that have signifi-
This paper introduces an approach for the building cant statistical evidence. These empirical-based char-
and applying of empirical-based knowledge to supple- acteristics are taken as hypotheses. With the use of
ment theoretical knowledge for the recognition of de- hypotheses and theoretical knowledge, a theory is built
signs from program source codes. The first section for the inference of designs.
introduces the proposed approach. The second section

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Building Empirical-Based Knowledge for Design Recovery

Secondly, we design experiments to validate the hy- Figure 1. An overview of the proposed design recovery
potheses. Some software tools should be developed to B
automate or semiautomate the characterization of the Identify Designs
or
properties stated in the hypotheses. We may merge their Properties
multiple hypotheses together as a single hypothesis for
the convenience of hypothesis testing. An experiment is
Identity Program
designed to conduct a binomial test (Gravetter & Wallnau, Charactertics and Formulate
Hypotheses
2000) for each resulting hypothesis. If altogether we
have k hypotheses denoted by H1,., Hk and we would
like the probability of validity of the proposed design Develop Software Tools
Develop Theory
to aid Experiments for
recovery to be more than or equal to q, then we must Hypotheses Testing
for the Inference of Designs

choose p1,., pk such that p1,., pk q. For each


hypothesis Hj (1 j k), the null and alternate hypoth-
esis of the binomial test states that less than pj*100% Conduct Experiements for
Hypotheses Testing
and equal or more than pj*100%, respectively, of the
cases that Hj holds. That is:
accept hypothesis Develop Algorithms
Hj0 (null hypothesis): probability that Hj holds < p for Automated Design
Recovery
Hj1 (alternate hypothesis): probability that Hj holds reject hypothesis

p
Develop Software Tool
to Implement the Algorithms
For the use of normal approximation for the bino-
mial test, both npj and n(1-pj) must be greater than or
equal to 10. As such, the sample size n must be greater Conduct Experiment
than or equal to max (10/pj, 10/(1-pj)). The experiment to validate the Effectiveness

is designed to draw a random sample of size n to test the


hypothesis. For each case in the sample, the validity of
the hypothesis is examined. The total number of cases,
X, that the hypothesis holds is recorded and substituted
in the following binomial test statistics: Applying the Proposed Approach for
the Recovery of Functional
X /n p Dependencies
j
z= ( p (1 p ) / n )
j j Let R be a record type and X be a sequence of attributes
of R. For any record r in R, its sequence of values of the
Let be the Type I error probability. If z is greater attributes in X is referred as the X-value of r. Let R be a
than z, we reject the null hypothesis; otherwise, we record type, and X and Y be sequences of attributes of R.
accept the null hypothesis, where the probability of We say that the functional dependency (FD), X Y of R,
standard normal model for z z is . holds at time t, if at time t, for any two R records r and s,
Thirdly, we develop the required software tools and the X-values of r and s are identical, then the Y-values of
conduct the experiments to test each hypothesis ac- r and s are also identical. We say that the functional
cording to the design drawn in the previous step. dependency holds in a database if, except in the midst of
Fourthly, if all the hypotheses are accepted, algo- a transaction execution (which updates some record types
rithms will be developed for the use of the theory to involved in the dependency), the dependency always
automatically recognize the designs from program source holds (Ullman, 1982).
codes. A software tool will also be developed to imple- Many of the worlds database applications have been
ment the algorithms. Some experiments should also be built on old generation DBMSs (database management
conducted to validate the effectiveness of the method. systems). Due to the nature of system development, many

113

TEAM LinG
Building Empirical-Based Knowledge for Design Recovery

functional dependencies are not discovered in the initial := y), {xclu_fd(X Y of R)}}), such that if the rd node
system development. They are only identified during the successfully reads an R record specified, and y is iden-
system maintenance stage. tical to the Y-value of the record read, is called an
Although keys can be used to implement functional insertion pattern for enforcing the FD, X Y of R.
dependencies in old generation DBMSs, due to the effort A program path in the form ({{xclu_fd(X Y of R)},
in restructuring databases during the system maintenance mdf(R, X == x0, Y := y), {xclu_fd(X Y of R)}}), such that
stage, most functional dependencies identified during this all the R records in which the X-values are equal to x 0 are
stage are not defined explicitly as keys in the databases. modified by the mdf node, is called a Y-modification
They are enforced in transactions. Furthermore, most of pattern for enforcing the FD, X Y of R.
the conventional files and relational databases allow only A program path in the form (rd(R, X == x), {xclu_fd(X
the definition of one key. As such, most of the candidate Y of R)}, {{xclu_fd(X Y of R)}, mdf(R, X == x0, X :=
keys are enforced in transactions. As a result, many func- x, Y := unchange), {xclu_fd(X Y of R)}}), such that the
tional dependencies in a legacy database are enforced in mdf node is only executed if the rd node does not
the transactions that update the database. Our previous successfully read an R record specified, is called an X-
work (Tan, 2004) has proven that if all the transactions that modification pattern for enforcing the FD, X Y of
update a database satisfy a set of properties with reference R.
to a functional dependency, X Y of R, then the func- We have proven the following rules by Tan (2004):
tional dependency holds in the database. Before proceed-
ing further, we shall discuss these properties. Nonviolation of FD: In a transaction, if all the
For generality, we shall express a program path in the nodes that insert R records or modify the at-
form (a1, , an), where ai is either a node or in the form tribute X or Y in any program path from the start
node to the end node are always contained in a
{ nii ,.., nik }, in which nii ,.., nik are nodes. If ai is in the
sequence of subpaths in the patterns for enforc-
ing the functional dependency, X Y of R, then
form { nii ,.., nik }, then in (a1, , a n), after the predeces-
the transaction does not violate the functional
sor node of a i, the subsequence of nodes, nii ,.., nik , dependency.
FD in Database: Each transaction that updates
may occur any number of times (possibly zero), before any record type involved in a functional depen-
proceeding to the successor of a i. dency does not violate the functional dependency
Before proceeding further, we shall introduce some if and only if is a functional dependency
notations to represent certain types of nodes in control designed in the database.
flow graphs of transactions that will be used throughout
this paper: Theoretically, the property stated in the first rule is
not a necessary property in order for the functional
rd(R, W == b): A node that reads or selects an R dependency to hold. As such, we may not be able to
record in which the W-value is b if it exists. recover all functional dependencies enforced by rec-
mdf(R, W == b, Z1 := c1,.., Z n := cn): A node that ognizing these properties.
modifies the Z 1-value, .. , Zn-value of an R record, Fortunately, other than very exceptional cases, most
in which the original W-value is b, to c1,.., cn, of the enforcement of functional dependency does
respectively. result to the previously mentioned property. As such,
ins(R, Z1 := c1,.., Zn := cn): A node that inserts an empirically, the property is usually also necessary for
R record, in which its Z1-value,.. , Zn-value are the functional dependency, X Y of R, to hold in the
set to c1,.., c n, respectively. database. Thus, we take the following hypothesis.

Here, R is a record type. W, Z 1, , Zn are sequences of Hypothesis 1: If a transaction does not violate
R attributes. The values of those attributes that are not the functional dependency, X Y of R, then all
mentioned in the mdf and ins nodes can be modified and the nodes that insert R records or modify the
set to any value. attribute X or Y in any program path from the start
We shall also use xclu_fd(X Y of R) to denote a node to the end node are always contained in a
node in the control flow graph of a transaction that does sequence of subpaths in the patterns for enforc-
not perform any updating that may lead to the violation of ing the functional dependency.
the functional dependency X Y of R.
A program path in the form (rd(R, X == x), {xclu_fd(X With the hypothesis, the result discussed by Tan and
Y of R)}, {{xclu_fd(X Y of R)}, ins(R, X := x, Y Thein (in press) can be extended to the following theorem.

114

TEAM LinG
Building Empirical-Based Knowledge for Design Recovery

Theorem 1: A functional dependency, X Y of R, manufacturer to the value of new-equipment-manu-


holds in a database if and only if in each transaction facturer. B
that updates any record type involved in the func-
tional dependency, all the nodes that insert R records We also analyzed 50 transactions from three exist-
or modify the attribute X or Y in any program path ing database applications. Table 1 summarizes the result
from the start node to the end node are always of the transactions designed by the students.
contained in a sequence of subpaths in the patterns We examined each transaction that enforces the
for enforcing the functional dependency. functional dependency. We found that in each of these
transactions, all the nodes that insert R records or
The proof of sufficiency can be found in previous modify the attribute X or Y in any program path from the
work (see Tan & Thein, 2004). The necessity is implied start node to the end node are always contained in a
directly from the hypothesis. sequence of subpaths in the patterns for enforcing the
This theorem can be used to recover all the functional functional dependency. As such, Hypothesis 1 holds in
dependencies in a database. We have developed an each of these transactions. Because each transaction
algorithm the recovery. The algorithm is similar to the that does not enforce the functional dependency cor-
algorithm presented by Tan and Thein (2004). rectly violates the functional dependency, Hypothesis 1
We have conducted an experiment to validate the always holds in such a transaction. Therefore, Hypoth-
hypothesis. In this experiment, we get each of our 71 esis 1 holds in all the 213 transactions. For the 50
postgraduate students who take the software require- transactions that we drew from three existing database
ments analysis and design course to design the follow- applications, each database application enforces only
ing three transactions in pseudocode to update the table, one functional dependency in their transactions (other
plant-equipment (plant-name, equipment-name, equip- functional dependencies are implemented as keys in the
ment-manufacturer, manufacturer-address), such that databases). We checked each transaction on Hypothesis
each transaction ensures that the functional dependency, 1 with respect to the functional dependency that is
equipment-manufacturer manufacturer-address of enforced in the transactions in the database application
plant-equipment holds: to which the transaction belongs. And Hypothesis 1 held
in all the transactions. Therefore, Hypothesis 1 holds
Insert a new tuple in the table to store a user input for all the 263 transactions in our sample.
each time. The input has four fields, inp-plant- Taking 0.001 () as the Type I error probability, if
name, inp-equipment-name, inp-equipment-manu- the binomial test statistics z computed from the formula
facturer and inp-manufacturer-address. In the tuple, discussed previously in this paper is greater than 3.09,
the value of each attribute is set according to the we reject the null hypothesis; otherwise, we accept the
corresponding input field. null hypothesis. In our experiment, we have n = 263, X
For each user input that has a single field, inp- = 263, and p = 0.96, so our z score is 3.31. Note that our
equipment-manufacturer, delete the tuple in the sample size is greater than max(10/0.96, 10/(1-0.96)).
table in which the value of the attribute, equip- Thus, we reject the null hypothesis, and the test gives
ment-manufacturer, is identical to the value of evidence that Hypothesis 1 holds for equal or more than
inp-equipment-manufacturer. 96% of the cases at the 0.1 % level of significance.
For each user input that has two fields, old-equip- Because only one hypothesis is involved in the approach
ment-manufacturer and new-equipment-manufac- for the inference of functional dependencies, the test
turer, update the equipment-manufacturer in each gives evidence that the approach holds for equal or more
tuple in which the value of equipment-manufac- than 96% of the cases at the 0.1 % level of significance.
turer is identical to the value of old-equipment-

FUTURE TRENDS
Table 1. The statistics of an experiment
In general, many designs are very difficult (if not impos-
sible) to formally prove from program characteristics
Enforcement of the (Chandra, Godefroid, & Palm, 2002; Clarke, Grumberg,
Transaction Functional Dependency & Peled,1999; Deng & Kothari, 2002). Therefore, the
Correct Wrong use of empirical-based knowledge is important for the
1 55 16 automated recognition of designs from program source
2 65 6 codes (Embury & Shao, 2001; Tan, Ling, & Goh, 2002;
3 36 35 Wong, 2001). We believe that the integration of empiri-

115

TEAM LinG
Building Empirical-Based Knowledge for Design Recovery

cal-based properties into existing program analysis and Gravetter, F. J., & Wallnau, L. B. (2000). Statistics for the
model-checking techniques will be a fruitful direction behavioral sciences. Belmont, CA: Wadsworth.
in the future.
Kitchenham, B. A., Pfleeger, S. L., Pickard, L. M., Jones,
P. W., Hoaglin, D. C., Emam, K. E., & Rosenberg, J. (2002).
Preliminary guidelines for empirical research in software
CONCLUSION engineering. IEEE Transactions on Software Engineer-
ing, 28(8), 721-734.
Empirical-based knowledge has been used in the recog-
nition of designs from source codes through automated Kozaczynski, W., Ning, J., & Engberts, A. (1992). Program
program analysis. It is a promising research direction. concept recognition and transformation. IEEE Transac-
This chapter introduces the approach for building em- tions on Software Engineering, 18(12), 1065-1075.
pirical-based knowledge, a vital part for such research
exploration. We have also applied it in the recognition Tan, H. B. K., Ling, T. W., & Goh, C. H. (2002). Exploring
of functional dependencies enforced in database trans- into programs for the recovery of data dependencies
actions from the transactions. We believe that our ap- designed. IEEE Transactions on Knowledge and Data
proach will encourage more exploration of the discov- Engineering, 14(4), 825-835.
ery and use of empirical-based knowledge in this area. Tan, H. B. K., & Thein, N. L. (in press). Recovery of PTUIE
Recently, we have completed our work on the recovery handling from source codes through recognizing its prob-
of posttransaction user-input error (PTUIE) handling able properties. IEEE Transactions on Knowledge and
for database transaction. This approach appeared in IEEE Data Engineering, 16(10), 1217-1231.
Transactions on Knowledge and Data Engineering
(Tan & Thein, 2004). Ullman, J. D. (1982). Principles of database systems (2nd
ed.). Rockville, MD: Computer Science Press.
Wong, K. (2001). Research challenges in reverse engi-
REFERENCES neering community. Proceedings of the International
Workshop on Program Comprehension (pp. 323-332),
Basili, V. R. (1996). The role of experimentation in Canada.
software engineering: Past, current, and future. The 18th
International Conference on Software Engineering (pp.
442-449), Germany. KEY TERMS
Beizer, B. (1990). Software testing techniques. New York: Control Flow Graph: An abstract data structure
Van Nostrand Reinhold. used in compilers. It is an abstract representation of a
Chandra, S., Godefroid, P., & Palm, C. (2002). Software procedure or program, maintained internally by a com-
model checking in practice: An industrial case study. piler. Each node in the graph represents a basic block.
Proceedings of the International Conference on Soft- Directed edges are used to represent jumps in the con-
ware Engineering (pp. 431-441), USA. trol flow.

Clarke, E. M., Grumberg, O., & Peled, D. A. (1999). Model Design Recovery: Recreates design abstractions
checking. MIR Press. from a combination of code, existing design documen-
tation (if available), personal experience, and general
Deng, Y. B., & Kothari, S. (2002). Recovering conceptual knowledge about problem and application domains.
roles of data in a program. Proceedings of the Interna-
tional Conference on Software Maintenance (pp. 342- Functional Dependency: For any record r in a
350), Canada. record type, its sequence of values of the attributes in X
is referred to as the X-value of r. Let R be a record type,
Embury, S. M., & Shao, J. (2001). Assisting the compre- and X and Y be sequences of attributes of R. We say that
hension of legacy transactions. Proceedings of the the functional dependency, X Y of R, holds at time t,
Working Conference on Reverse Engineering (pp. 345- if at time t, for any two R records r and s, the X-values
354), Germany. of r and s are identical, then the U-values of r and s are
Ferrante, J., Ottenstein, K. J., & Warren, J. O. (1987). The also identical.
program dependence graph and its use in optimisation. Hypothesis Testing: Hypothesis testing refers to
ACM Transaction on Programming Languages and Sys- the process of using statistical analysis to determine if the
tems, 9(3), 319-349. observed differences between two or more samples are

116

TEAM LinG
Building Empirical-Based Knowledge for Design Recovery

due to random chance (as stated in the null hypothesis) to the set of values or behaviors arising dynamically at run-
or to true differences in the samples (as stated in the time when executing a program on a computer. B
alternate hypothesis).
PTUIE: Posttransaction user-input error. An error
Model Checking: A method for formally verifying made by users in an input to a transaction execution and
finite-state concurrent systems. Specifications about discovered only after completion of the execution.
the system are expressed as temporal logic formulas,
and efficient symbolic algorithms are used to traverse the Transaction: An atomic set of processing steps in a
model defined by the system and check whether the database application such that all the steps are per-
specification holds. formed either fully or not at all.

Program Analysis: Offers static compile-time tech-


niques for predicting safe and computable approximations

117

TEAM LinG
118

Business Processes
David Sundaram
The University of Auckland, New Zealand

Victor Portougal
The University of Auckland, New Zealand

INTRODUCTION all functional areas were presented as subsystems sup-


ported by a separate functional areas information
The concept of a business process is central to many system. However, in a real business environment func-
areas of business systems, specifically to business sys- tional areas are interdependent, each requiring data
tems based on modern information technology. In the from the others. This fostered the development of the
new era of computer-based business management the concept of a business process as a multi-functional set
design of business process has eclipsed the previous of activities designed to produce a specified output.
functional design. Lindsay, Downs, and Lunn (2003) This concept has a customer focus. Suppose, a de-
suggest that business process may be divided into mate- fective product was delivered to a customer, it is the
rial, information and business parts. Further search for business function of customer service to accept the
efficiency and cost reduction will be predominantly defective item. The actual repair and redelivery of the
through office automation. The information part, in- item, however, is a business process that involves sev-
cluding data warehousing, data mining and increasing eral functional areas and functions within those areas.
informatisation of the office processes will play a key The customer is not concerned about how the product
role. Data warehousing and data mining, in and of them- was made, or how its components were purchased, or
selves, are business processes that are aimed at increas- how it was repaired, or the route the delivery truck took
ing the intelligent density (Dhar & Stein, 1997) of the to get to her house. The customer wants the satisfaction
data. But more important is the significant roles they play of having a working product at a reasonable price. Thus,
within the context of larger business and decision pro- the customer is looking across the companys func-
cesses. Apart from these the creation and maintenance of tional areas in her process. Business managers are now
a data warehouse itself comprises a set of business trying to view their business operations from the per-
processes (see Scheer, 2000). Hence a proper under- spective of a satisfied customer. Thinking in terms of
standing of business processes is essential to better business processes helps managers to look at their
understand data warehousing as well as data mining. organization from the customers perspective. ERP pro-
grams help to manage company wide business pro-
cesses, using a common database and shared manage-
BACKGROUND ment reporting tools. ERP software supports the effi-
cient operation of business processes by integrating
Companies that make products or provide services have business activities, including sales, marketing, manu-
several functional areas of operations. Each functional facturing, accounting, and staffing.
area comprises a variety of business functions or busi-
ness activities. For example, functional area of financial
accounting includes the business functions of financials, MAIN THRUST
controlling, asset management, and so on. Human re-
sources functional area includes the business functions In the following sections we first look at some defini-
of payroll, benefit administration, workforce planning, tions of business processes and follow this by some
and application data administration. Historically, orga- representative classifications of business process. Af-
nizational structures have separated functional areas, ter laying this foundation we look at business processes
and business education has been similarly organized, so that are specific to data warehousing and data mining.
each functional area was taught as a separate course. In The section culminates by looking at a business process
materials requirement planning (MRP) systems, prede- modelling methodology (namely ARIS) and briefly dis-
cessors of enterprise resource planning (ERP) systems, cussing the dark side of business process (namely mal-
processes).

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Business Processes

Definitions of Business Processes initiated in response to a specific event


B
Many definitions have been put forward to define business Every process is initiated by an event.
processes: some broad and some narrow. The broad ones The event is a request for the result produced by
help us understand the range and scope of business pro- the process.
cesses but the narrow ones are also valuable in that they are
actionable/pragmatic definitions that help us in defining, work tasks
modelling, and reengineering business processes.
Ould (1995) lists a few key features of Business The business process is a collection of clearly
Processes; it contains purposeful activity, it is carried identifiable tasks executed by one or more actors
out collaboratively by a group, it often crosses func- (person or organisation or machine or depart-
tional boundaries, it is invariably driven by outside ment).
agents or customers. Jacobson (1995) on the other hand A task could potentially be divided up into more
succinctly describes a business process as: The set of and finer steps.
internal activities performed to serve a customer. Bider
(2000) suggests that the business process re-engineer- a collection of interrelated
ing (BPR) community feel there is no great mystery
about what a process is - they follow the most general Such steps and tasks are not necessarily sequential
definition of business processes proposed by Hammer but could have parallel flows connected with com-
and Champy (1993) that a process is a set of partially plex logic.
ordered activities intended to reach a goal. The steps are interconnected through their dealing
Davenport (1993) defines process broadly as a with or processing one (or more) common work
structured, measured set of activities designed to pro- item(s) or business object(s)
duce a specified output for a particular customer or
market and more specifically as a specific order of Due to the importance of the business process
work activities across time and place, with a beginning, concept to the developing of the computerized enter-
an end, and clearly identified inputs and outputs: a prise management the work on refining the definition is
structure for action. going on. Lindsay, Downs, and Lunn (2003) argue that
While these definitions are useful they are not ad- the definitions of business process given in much of the
equate. Sharp and McDermott (2001) provide an excel- literature on Business Process Management (BPM) are
lent working definition of a business process: limited in depth and their related models of business
processes are too constrained. Because they are too
A business process is a collection of interrelated work limited to express the true nature of business processes,
tasks, initiated in response to an event that achieves they need to be further developed and adapted to todays
a specific result for the customer of the process. challenging environment.

It is worth exploring each of the phrases within this Business Process Classification
definition.
Over the years many classifications of processes have
achieves a particular result been suggested. The American Productivity & Quality
Center (Process Classification Framework, 1996) dis-
The result might be Goods and/or Services. tinguishes two types of processes 1) operating pro-
It should be possible to identify and count the cesses and 2) management and support processes.
result eg. Fulfilment of Orders, Resolution of Operating processes include processes such as:
Complaints, Raising of Purchase Orders, etc.
Understanding Markets and Customers
for the customer of the process Development of Vision and Strategy
Design of Product and Services
Every process has a customer. The customer maybe Marketing and Selling of Products and Services
internal (employee) or external (organisation). Production and Delivery of Products and Services
A key requirement is that customer should be able Invoicing and Servicing of Customers
to give feedback on the process

119

TEAM LinG
Business Processes

In contrast management and support processes in- standable patterns in data. Some of the key steps of the
clude processes such as: data mining business process involve:

Development and Management of Human Resources Understanding the business


Management of Information Preparation of the data this would usually in-
Management of Financial and Physical Resources volve data selection, pre-processing, and poten-
Management of External Relationships tially transformation of the data into a form that
Management of Improvement and Change is more amenable to modelling and understanding
Execution of Environmental Management Program Modelling using data mining techniques
Interpretation and evaluation of the model and
Most organisations would be able to fit their pro- results which will hopefully help us to better
cesses within this broad classification. However the understand the business
Gartner group (Genovese, Bond, Zrimsek, & Frey, 2001) Deploy the solution
has come up with a slightly different set of processes
that has a lifecycle orientation: Obviously the above steps are iterative at all stages
of the process and could feed backward. Tools such as
Prospect to Cash and Care Clementine from SPSS support all the above steps.
Requisition to Payment
Planning and Execution Business Processes Modelling
Plan to Performance
Design to Retirement The business processes frequently have to be recorded.
Human Capital Management According to Scheer (2000) the components of a pro-
cess description are events, statuses, users, organiza-
Each and every one of these processes could be tional units, and information technology resources.
potentially supported by data warehousing and data min- This multitude of elements would severely complicate
ing processes and technologies. the business process model. In order to reduce this
complexity, many researchers suggest that the model
Data Warehousing Business Processes is divided into individual views that can be handled
(largely) independent from one another. This is done
The information movement and use come under the for example in ARIS (see Scheer, 2000). The relation-
category of a business process when it involves human- ships between the components within a view are tightly
computer interaction. Data warehousing includes a set of coupled while the individual views are rather loosely
business processes, classified above as management of linked.
information. The set includes: ARIS includes four basic views, represented by:

Data input Data view: statuses and events represented as


Data export data and the other data structures of the process
Special query and reporting (regular reporting is a Function view: the description of functions used
fully computer process) in business processes
Security and safety procedures
Organization view: the users and the organiza-
The bulk of the processes are of course in the data tional units as well as their relationships and the
input category. It is here is that many errors originate relevant structures
that can disrupt the smooth operations of the manage- Control view: presenting the relationships be-
ment system. Strict formalisation of the activities and tween the other three views.
their sequence brings the major benefits. However, the
other categories are important as well, and their formal The integration of these three views through a sepa-
description and enforcement help to avoid mal-func- rate view is the main difference between ARIS and
tioning of the system. other descriptive methodologies, like Object Oriented
Design (OOD) (Jacobson, 1995).
Data Mining Business Processes Breaking down the initial business process into
individual views reduces the complexity of its descrip-
Data mining is a key business process that enables tion. Figure 1 illustrates these four views.
organisations to identify valid, novel, useful, and under-

120

TEAM LinG
Business Processes

Figure 1. ARIS views/house of business engineering view. This view, however, is significant for the subject-
(Scheer, 2000) related view of business processes only when it gives an B
opportunity for describing in full the other components
that are more directly linked toward the business.

Organisation View Mal-Processes

Most definitions of business processes suggest that


they have a goal of accomplishing a given task for the
customer of the process. This positive orientation has
been reflected in business process modelling method-
ologies, tools, techniques and even in the reference
Data Control Function models and implementation approaches. They do not
View View View explicitly consider the negative business process sce-
narios that could result in the accomplishment of unde-
sirable outcomes.
Consider a production or service organisation that
implements an ERP system. Let us assume the design
The control view can be depicted using event-driven stage is typical and business process oriented. We use
process chains (EPC). The four key components of the requirements elicitation techniques to define the as-
EPC are events (statuses), functions (transformations), is business process and engineer to-be processes
organization, and information (data). Events (initial) trig- (reference) using as much as possible, fragments of the
ger functions that then result in events (final). The func- reference model of a corresponding ERP system.
tions are carried out by one or more organizational units As a result of the design, a set of business processes
using appropriate data. A simplified depiction, of the will be created, defining a certain workflow for manag-
interconnection between the four basic components of ers (employees) of the company. These processes will
the EPC is illustrated in Figure 2. This was modeled using consist of functions that should be executed in a certain
the ARIS software (IDS Scheer, 2000). order defined by events. The employee responsible for
One or more events can be connected to two or more execution of a business process or a part of it will be
functions (and vice versa) using the three logical opera- called its user. Every business process starts from a
tors OR, AND, and XOR (exclusive OR). These basic pre-defined set of events, and performs a pre-defined
elements and operators enable us to model even the set of functions which involve either data input or data
most complex process. modification (when a definite data structure is taken
The SAP Reference Model (Curran, Keller, & Ladd, from the data-base, modified and put back, or trans-
1998) is also based on the four ARIS views, it employs ferred to another destination).
several other views, which are not so essential, but When executing a process, a user may intentionally
nevertheless helpful in depicting, storing and retrieving put wrong data into the database, or modify the data in the
business processes. wrong way: e.g. execute an incomplete business process,
For example, information technology resources can or execute a wrong branch of the business process, or
constitute a separate descriptive object, the resource even create an undesirable branch of a business process.
Such processes are called mal-processes.
Hence mal-processes are behaviour to be avoided
Figure 2. Basic components of the EPC
by the system. It is a sequence of actions that a system
can perform, interacting with a legal user of the system,
Initial Event resulting in harm for the organization or stakeholder if
the sequence is allowed to complete. The focus is on the
actions, which may be done intentionally or through
neglect. Similar effects produced accidentally are a
Transforming Organizational
Data
Function unit
subject of interest for data safety. Just as we have best
business practices, mal-processes illustrate the wrong
business practices. So, it would be natural, by analogy
with the reference model (that represents a repository
Final Event of business models reflecting best business practices),
to create a repository of business models reflecting

121

TEAM LinG
Business Processes

wrong business practices. Thus a designer could avoid Data Mining (CRISP-DM, 2004) that provides a lifecycle
wrong design solutions. process oriented approach to the mining of data in
A repository of such mal-processes would enable organizations. Apart from this data warehousing and data
organisations to avoid typical mistakes in enterprise mining techniques are a key element in various steps of
system design. Such a repository can be a valuable asset the process lifecycle alluded to in the trends. Data
in education of ERP designers and it can be useful in warehousing and data mining techniques help us not only
general management education as well. Another appli- in process identification and analysis (identifying and
cation of this repository might be troubleshooting. For analyzing candidates for improvement) but also in ex-
example, sources of some nasty errors in the sales and ecution and monitoring and control of processes. Data
distribution system might be found in the mal-pro- warehousing technologies enable us in collecting, ag-
cesses of data entry in finished goods handling. gregating, slicing and dicing of process information
while data mining technologies enable us in looking for
patterns in process information enabling us to monitor
FUTURE TRENDS and control the organizational process more efficiently.
Thus there is a symbiotic mutually enhancing relation-
There are three key trends that characterise business ship both at the conceptual level as well as at the tech-
processes: digitisation (automation), integration (intra nology level between business processes and data ware-
and inter organisational), and lifecycle management housing and data mining.
(Kalakota & Robinson, 2003). Digitisation involves the
attempts by many organisations to completely automate
as many of their processes as possible. Another equally REFERENCES
important initiative is the seamless integration and co-
ordination of processes within and without the APQC (1996). Process classification framework (pp. 1-6).
organisation: backward to the supplier, forward to the APQCs International Benchmark Clearinghouse & Arthur
customers, and vertically of operational, tactical, and Andersen & Co.
strategic business processes. The management of both
these initiatives/trends depends to a large extent on the Bider, I., Johannesson, P., & Perjons, E. (2002). Goal-
proper management of processes throughout their oriented patterns for business processes. Workshop on
lifecycle: from process identification, process model- Goal-Oriented Business Process Modelling, London.
ling, process analysis, process improvement, process CRISP-DM (2004). Retrieved from http://www.crisp-dm.org
implementation, process execution, to process monitor-
ing/controlling (Rosemann, 2001). Implementing such a Curran, T., Keller, G., & Ladd, A. (1998). SAP R/3 business
lifecycle orientation enables organizations to move in blueprint: Understanding business process reference
benign cycles of improvement and sense, respond, and model. Upper Saddle River, NJ: Prentice Hall.
adapt to the changing environment (internal and external). Davenport, T.H. (1993). Process innovation. Boston, MA:
All these trends require not only the use of Enterprise Harvard Business School Press.
Systems as a foundation but also data warehousing and
data mining solutions. These trends will continue to be Davis, R., (2001) Business process modelling with
major drivers of the enterprise of the future. ARIS: A practical guide. UK: Springer-Verlag.
Genovese, Y., Bond, B., Zrimsek, B., & Frey, N. (2001).
The transition to ERP II: Meeting the challenges.
CONCLUSION Gartner Group.
In this modern landscape, business processes and tech- Hammer, M., & Champy, J. (1993). Re-engineering the
niques and tools for data warehousing and data mining Corporation: A manifesto for business revolution. New
are intricately linked together. The impacts are not just York: Harper Business.
one way. Concepts from business processes could be
and are used to make the data mining and data warehous- Jacobson. (1995). The object advantage. Addison-
ing processes an integral part of organizational pro- Wesley.
cesses. Data warehousing and data mining processes are Kalakota, R., & Robinson, M. (2003). Services blue-
a regular part of organizational business processes, print: Roadmap for execution. Boston: Addison-
enabling the conversion of operational information into Wesley.
tactical and strategic level information. An example of
this is the Cross Industry Standard Process (CRISP) for

122

TEAM LinG
Business Processes

Lindsay, D., Downs, K., & Lunn. (2003). Business pro- Business Process: A business process is a collection
cessesattempts to find a definition. Information and of interrelated work tasks, initiated in response to an B
Software Technology, 45, 1015-1019. event that achieves a specific result for the customer of
the process.
Ould, A. M. (1995). Business processes: Modelling and
analysis for reengineering. Wiley. Digitisation: Measures that automate processes.
Rosemann, M. (2001, March). Business process lifecycle ERP: Enterprise resource planning system, a software
management. Queensland University of Technology. system for enterprise management. It is also referred to as
Enterprise Systems (ES).
Scheer, (2000). ARIS methods. IDS.
Functional Areas: Companies that make products
Scheer, A.-W., & Habermann, F. (2000). Making ERP a to sell have several functional areas of operations.
success. Communications of the Association of Com- Each functional area comprises a variety of business
puting Machinery, 43(4) 57-61. functions or business activities.
Sharp, A., & McDermott, P. (2001). Just what are pro- Integration of Processes: The coordination and
cesses anyway? Workflow modeling: tools for process integration of processes seamlessly within and without
improvement and application development (pp. 53-69). the organization.
Mal-Processes: A sequence of actions that a sys-
tem can perform, interacting with a legal user of the
KEY TERMS system, resulting in harm for the organization or stake-
holder.
ARIS: Architecture of Integrated Information Sys-
tems, a modeling and design tool for business pro- Process Lifecycle Management: Activities under-
cesses. taken for the proper management of processes such as
identification, analysis, improvement, implementation,
as-is Business Process: Current business pro- execution, and monitoring.
cess.
to-be Business Process: Re-engineered busi-
ness process.

123

TEAM LinG
124

Case-Based Recommender Systems


Fabiana Lorenzi
Universidade Luterana do Brasil, Brazil

Francesco Ricci
eCommerce and Tourism Research Laboratory, ITC-irst, Italy

INTRODUCTION about what products fit the customers preferences (Burke,


2000). The most important advantage is that knowledge
Recommender systems are being used in e-commerce can be expressed as a detailed user model, a model of the
web sites to help the customers in selecting products selection process or a description of the items that will be
more suitable to their needs. The growth of Internet and suggested. Knowledge-based recommenders can exploit
the business to consumer e-Commerce has brought the the knowledge contained in case or encoded in a similarity
need for such a new technology (Schafer, Konstan, & metric. Case-Based Reasoning (CBR) is one of the meth-
Riedl., 2001). odologies used in the knowledge-based approach. CBR is
a problem solving methodology that faces a new problem
by first retrieving a past, already solved similar case, and
BACKGROUND then reusing that case for solving the current problem
(Aaamodt & Plaza, 1994). In a CBR recommender system
In the past years, a number of research projects have (CBR-RS) a set of suggested products is retrieved from
focused on recommender systems. These systems imple- the case base by searching for cases similar to a case
ment various learning strategies to collect and induce described by the user (Burke, 2000). In the simplest
user preferences over time and automatically suggest application of CBR to recommendation problem solving,
products that fit the learned user model. the user is supposed to look for some product to pur-
The most popular recommendation methodology is chase. He/she inputs some requirements about the prod-
collaborative filtering (Resnick, Iacovou, Suchak, uct and the system searches in the case base for similar
Bergstrom, & Riedl, 1994) that aggregates data about products (by means of a similarity metric) that match the
customers preferences (ratings) to recommend new user requirements. A set of cases is retrieved from the case
products to the customers. Content-based filtering base and these cases can be recommender to the user. If
(Burke, 2000) is another approach that builds a model of the user is not satisfied with the recommendation he/she
user interests, one for each user, by analyzing the spe- can modify the requirements, i.e. build another query, and
cific customer behavior. In collaborative filtering the a new cycle of the recommendation process is started.
recommendation depends on the previous customers In a CBR-RS the effectiveness of the recommendation
information, and a large number of previous user/sys- is based on: the ability to match user preferences with
tem interactions are required to build reliable recom- product description; the tools used to explain the match
mendations. In content-based systems only the data of and to enforce the validity of the suggestion; the function
the current user are exploited and it requires either provided for navigating the information space. CBR can
explicit information about user interest, or a record of support the recommendation process in a number of
implicit feedback to build a model of user interests. ways. In the simplest approach the CBR retrieval is called
Content-based systems are usually implemented as clas- taking in input a partial case defined by a set of user
sifier systems based on machine learning research preferences (attribute-value pairs) and a set of products
(Witten & Frank, 2000). In general, both approaches do matching these preferences are returned to the user.
not exploit specific knowledge of the domain. For in-
stance, if the domain is computer recommendation, the
two above approaches, in building the recommendation MAIN THRUST
for a specific customer, will not exploit knowledge
about how a computer works and what is the function of CBR systems implement a problem solving cycle very
a computer component. similar to the recommendation process. It starts with a
Conversely, in a third approach called knowledge- new problem, retrieves similar cases from the case base
based, specific domain knowledge is used to reason and shows to the user an old solution or adapts it to better

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Case-Based Recommender Systems

Figure 1. CBR-RS framework (Lorenzi & Ricci, 2004) This means that a case c = (x, u, s, e) CB, generally
consists of four (optional) sub-elements x, u, s, e, which C
are elements of the spaces X, U, S, E respectively. Each
CBR-RS adopts a particular model for the spaces X, U,
S, E. These spaces could be empty, vector, set of docu-
ment (textual), labeled graphs, etc.

Content model (X): the content model describes


the attributes of the product.
User profile (U): the user profile models personal
user information, such as, name, address, and age
or also past information about the user, such as her
preferred products.
Session model (S): the session model is intro-
duced to collect information about the recom-
solve the new problem and finishes retaining the new case mendation session (problem solving loop). In
in the case base. Considering the classic CBR cycle (see DieToRecs, for instance, a case includes a tree-
Aamodt & Plaza, 1994) we specialized this general frame- based model of the user interaction with the sys-
work to the specific tasks of product recommendation. In tem and it is built incrementally during the recom-
Figure 1 the boxes, corresponding to the classical CBR mendation session.
steps (retrieve, reuse, revise, review, and retain), contain Evaluation model (E): the evaluation model de-
references to systems or functionalities (acronyms) that scribe the outcome of the recommendation, i.e., if
will be described in the next sections. the suggestion was appropriate or not. This could
We now provide a general description of the frame- be a user a-posteriori evaluation, or, as in
work by making some references to systems that will be (Montaner, Lopez, & la Rosa, 2002), the outcome
better described in the rest of the paper. The first user- of an evaluation algorithm that guesses the good-
system interaction in the recommendation cycle occurs ness of the recommendation (exploiting the case
in the input stage. According to Bergmann, Richter, base of previous recommendations).
Schmitt, Stahl, and Vollrath (2001), there are different
strategies to interact with the user, depending on the level Actually, in CBR-RSs there is a large variability in
of customer assistance offered during the input. The most what a case really models and therefore what compo-
popular strategy is the dialog-based, where the system nents are really implemented. There are systems that
offers guidance to the user by asking questions and use only the content model, i.e., they consider a case as
presenting products alternatives, to help the user to a product, and other systems that focus on the perspec-
decide. Several CBR recommender systems ask the user tive of cases as recommendation sessions.
for input requirements to have an idea of what the user The first step of the recommendation cycle is the
is looking for. In the First Case system (McSherry, retrieval phase. This is typically the main phase of the
2003a), for instance, the user provides the features of a CBR cycle and the majority of CBR-RSs can be de-
personal computer that he/she is looking for, such as, scribed as sophisticated retrieval engines. For example,
type, price, processor or speed. Expertclerk (Shimazu, in the Compromise-Driven Retrieval (McSherry, 2003b)
2002) asks the user to answer some questions instead of the system retrieves similar cases from the case base
provide requirements. And with the set of the answered but also groups the cases, putting together those offer-
questions the system creates the query. ing to the user the same compromise, and presents to the
In CBR-RSs, the knowledge is stored in the case user just a representative case for each group.
base. A case is a piece of knowledge related to a particular After the retrieval, the reuse stage decides if the case
context and representing an experience that teaches an solution can be reused in the current problem. In the
essential lesson to reach the goal of the problem-solving simplest CBR-RSs, the system reuses the retrieved
activity. Case modeling deals with the problem of deter- cases showing them to the user. In more advanced solu-
mining which information should be represented and tions, such as (Montaner, Lopez, & la Rosa, 2002) or
which formalism of representation would be suitable. In (Ricci et al., 2003), the retrieved cases are not recom-
CBR-RSs a case should represent a real experience of mended but used to rank candidate products identified
solving a user recommendation problem. In a CBR-RS, our with other approaches (e.g. Ricci et al., 2003) with an
analysis has identified a general internal structure of the interactive query management component.
case base: CB = [X U S E].

125

TEAM LinG
Case-Based Recommender Systems

In the next stage the reused case component is adapted DieToRecs DTR
to better fit in the new problem. Mostly, the adaptation in
CBR-RSs is implemented by allowing the user to customize DieToRecs helps the user to plan a leisure travel
the retrieved set of products. This can also be implemented (Fesenmaier, Ricci, Schaumlechner, Wober, & Zanella
as a query refinement task. For example, in Comparison- 2003). We present here two different approaches (de-
based Retrieval (McGinty & Smyth, 2003) the system asks cision styles) implemented in DieToRecs: the single
user feedbacks (positive or negative) about the retrieved item recommendation (SIR), and the travel completion
product and with this information it updates the user (TC). A case represents a user interaction with the sys-
query. tem, and it is built incrementally during the recommenda-
The last step of the CBR recommendation cycle is the tion session. A case comprises all the quoted models:
retain phase (or learning), where the new case is retained content, user profile, session and evaluation model.
in the case base. In DieToRecs (Fesenmaier, Ricci, SIR starts with the user providing some prefer-
Schaumlechner, Wober, & Zanella, 2003), for instance, ences. The system searches in the catalog for products
all the user/system recommendation sessions are stored that (logically) match with these preferences and re-
as cases in the case base. turns a result set. This is not to be confused with the
The next subsections describe very shortly some repre- retrieval set that contains a set of similar past recom-
sentative CBR-RSs, focusing on their peculiar characteris- mendation session. The products in the result set are
tic (see Lorenzi & Ricci, 2004) for the complete report. then ranked with a double similarity function (Ricci et
al., 2003) in the revise stage, after a set of relevant
Interest Confidence Value ICV recommendation sessions are retrieved.
In the TC function the cycle starts with the users
Montaner, Lopez, and la Rosa (2002) assume that the preferences too but the system retrieves from the case
users interest in a new product is similar to the users base cases matching users preferences. Before rec-
interest in similar past products. This means that when a ommending the retrieved cases to the user, the system
new product comes up, the recommender system pre- in the revise stage updates, or replaces, the travel
dicts the users interest in it based on interest attributes products contained in the case exploiting up-to-date
of similar experiences. A case is modeled by objective information taken from the catalogues. In the review
attributes describing the product (content model) and phase the system allows the user to reconfigure the
subjective attributes describing implicit or explicit in- recommended travel plan. The system allows the user
terests of the user in this product (evaluation model), to replace, add or remove items in the recommended
i.e., c X E. In the evaluation model, the authors intro- travel. When the user accepts the outcome (the final
duced the drift attribute, which models a decaying im- version of the recommendation shown to the user), the
portance of the case as time goes and the case is not used. system retains this new case in the case base.
The system can recommend in two different ways:
prompted or proactive. In prompted mode, the user pro- Compromise-Driven Retrieval CDR
vides some preferences (weights in the similarity metric)
and the system retrieves similar cases. In the proactive CDR models a case only by the content component
recommendation, the system does not have the user prefer- (McSherry, 2003a). In CDR, if a given case c1 is more
ences, so it estimates the weights using past interactions. similar to the target query than another case c2, and
In the reuse phase the system extracts the interest differs from the target query in a subset of the at-
values of retrieved cases and in the revise phase it calcu- tributes in which c2 differs from the target query, then
lates the interest confidence value of a restaurant to c 1 is more acceptable than c2.
decide if this should be recommender to the user or not. In the CDR retrieval algorithm the system sorts all
The adaptation is done asking to the user the correct the cases in the case-base according to the similarity to
evaluation of the product and after that a new case (the a given query. In a second stage, it groups together the
product and the evaluation) is retained in the case base. cases making the same compromise (do not match a
It is worth noting that in this approach the recommended user preferred attribute value) and builds a reference
product is not retrieved from the case base, but the set with just one case for each compromise group. The
retrieved cases are used to estimate the user interest in cases in the reference set are recommended to the user.
this new product. This approach is similar to that used in The user can also refine (review) the original query,
DieToRecs in the single item recommendation function. accepting one compromise, and adding some preference
on a different attribute (not that already specified). The
system will further decompose the set of cases corre-

126

TEAM LinG
Case-Based Recommender Systems

sponding to the selected compromise. The revise and Table 1. Comparison of the CBR-RSs
retain phases do not appear in this approach.
Approach
ICV
Retrieval
Similarity
Reuse
IC value
Revise
IC computation
Review
Feedback
Retain
Default
C
SIR Similarity Selective Rank User edit Default
ExpertClerk EC TC
OBR
Similarity
Similarity +
Default
Default
Logical query
None
User edit
Tweak
Default
None
Ordering
CDR Similarity + Default None Tweak None
Expertclerk is a tool for developing a virtual salesclerk Grouping
EC Similarity Default None Feedback None
system as a front-end of e-commerce websites (Shimazu,
2002). The system implements a question selection method
(decision tree with information gain). Using navigation- area have already delivered, how the existing CBR-RSs
by-asking, the system starts the recommendation session behave and which are the topics that could be better
asking questions to user. The questions are nodes in a exploit in future systems.
decision tree. A question node subdivides the set of
answer nodes and each one of these represents a different
answer to the question posed by the question node. The CONCLUSION
system concatenates all the answer nodes chosen by the
user and then constitutes the SQL retrieval condition In the previous sections we have briefly analyzed eight
expression. different CBR recommendations. Table 1 shows the
This query is applied to the case base to retrieve the main features of these approaches.
set of cases that best match the user query. Then, the Some observations are in order. The majority of the
system shows three samples products to the user and CBR-RSs stress the importance of the retrieval phase.
explains their characteristics (positive and negative). Some systems perform retrieval in two steps. First,
In the review phase, the system switches to the cases are retrieved by similarity, and then the cases are
navigation-by-proposing conversation mode and allows grouped or filtered. The use of pure similarity does not
the user to refine the query. After refinement, the sys- seem to be enough to retrieve a set of cases that satisfy
tem applies the new query to the case base and retrieves the user. This seems to be true especially in those
new cases. These cases are ranked and shown to the user. application domains that require a complex case struc-
The cycle continues until the user finds a preferred ture (e.g. travel plans) and therefore require the devel-
product. In this approach the revise and the retain phases opment of hybrid solutions for case retrieval.
are not implemented. The default reuse phase is used in the majority of the
CBR-RSs, i.e., all the retrieved cases are recommended
to the user. ICV and SIR have implemented the reuse
FUTURE TRENDS case in different way. In SIR, for instance, the system
can retrieve just part of the case. The same systems that
This paper presented a review of the literature on CBR implemented non-trivial reuse approaches, have also
recommender systems. We have found that it is often implemented both the revise phase, where the cases are
unclear how and why the proposed recommendation adapted, and the retain phase, where the new case (adapted
methodology can be defined as case-based and there- case) is stored.
fore we have introduced a general framework that can All the CBR-RSs analyzed implement the review phase,
illustrate similarities and differences of various ap- allowing the user to refine the query. Normally the sys-
proaches. Moreover, we have found that the classical tem expects some feedback from the user (positive or
CBR problem-solving loop is implemented only par- negative), new requirements or a product selection.
tially and sometime is not clear whether a CBR stage
(retrieve, reuse, revise, review, retain) is implemented
or not. For this reason, the proposed unifying frame- REFERENCES
work makes possible a coherent description of different
CBR-RSs. In addition an extensive usage of this frame- Aamodt, A., & Plaza, E. (1994). Case-based reasoning:
work can help describing in which sense a recommender Foundational issues, methodological variations, and
system exploits the classical CBR cycle, and can point system approaches. AI Communications, 7(1), 39-59.
out new interesting issues to be investigated in this area.
For instance, the possible ways to adapt retrieved cases Bergmann, R., Richter, M., Schmitt, S., Stahl, A., &
to improve the recommendation and how to learn these Vollrath, I. (2001). Utility-oriented matching: New re-
adapted cases. search direction for case-based reasoning. 9th German
We believe, that with such a common view it will be Workshop on Case-Based Reasoning, GWCBR01 (pp.
easier to understand what the research projects in the 14-16), Baden-Baden, Germany.

127

TEAM LinG
Case-Based Recommender Systems

Burke, R. (2000). Knowledge-based recommender sys- active query management and twofold similarity. 5th Inter-
tems. Encyclopedia of Library and Information Science, national Conference on Case-Based Reasoning, ICCBR
Vol. 69. 2003 (pp. 479-493), Trondheim, Norway.
Fesenmaier, D, Ricci, F, Schaumlechner, E, Wober, K., Schafer, J.B, Konstan, J. A., & Riedl, J. (2001). E-
& Zanella, C. (2003). DIETORECS: Travel advisory for commerce recommendation applications. Data Mining
multiple decision styles. Information and Communi- and Knowledge Discovery, 5(1/2), 115-153.
cation Technologies in Tourism, 232-241.
Shimazu, H. (2002). Expertclerk: A conversational case-
Lorenzi, F., & Ricci, F. (2004). A unifying framework based reasoning tool for developing salesclerk agents in
for case-base reasoning recommender systems. Tech- e-commerce webshops. Artificial Intelligence Review,
nical Report, IRST. 18, 223-244.
McGinty, L., & Smyth, B. (2002). Comparison-based Witten, I. H., & Frank, E. (2000). Data mining. Morgan
recommendation. Advances in Case-Based Reason- Kaufmann Publisher.
ing, 6th European Conference on Case Based Reason-
ing, ECCBR 2002 (pp. 575-589), Aberdeen, Scotland.
McGinty, L., & Smyth, B. (2003). The power of sugges- KEY TERMS
tion. 18th International Joint Conference on Artificial
Intelligence, IJCAI-03 (pp. 276-290), Acapulco, Mexico. Case-Based Reasoning: It is an Artificial Intelli-
McSherry, D. (2003a). Increasing dialogue efficiency in gence approach that solves new problems using the
case-based reasoning without loss of solution quality. solutions of past cases.
18 th International Joint Conference on Artificial Intelli- Collaborative Filtering: Approach that collects
gence, IJCAI-03 (pp. 121-126), Acapulco, Mexico. user ratings on currently proposed products to infer the
McSherry, D. (2003b). Similarity and compromise. 5th In- similarity between users.
ternational Conference on Case-Based Reasoning, Content-Based Filtering: Approach where the user
ICCBR 2003 (pp. 291-305), Trondheim, Norway. expresses needs and preferences on a set of attributes
Montaner, M., Lopez, B., & la Rosa, J.D. (2002). Improv- and the system retrieves the items that match the de-
ing case representation and case base maintenance in scription.
recommender systems. Advances in Case-Based Rea- Conversational Systems: Systems that can com-
soning, 6th European Conference on Case Based Reason- municate with users through a conversational paradigm
ing, ECCBR 2002 (pp. 234-248),Aberdeen, Scotland.
Machine Learning: The study of computer algo-
Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., & rithms that improve automatically through experience.
Riedl, J. (1994). Grouplens: An open architecture for
collaborative filtering of Netnews. ACM Conference Recommender Systems: Systems that help the user to
on Computer-Supported Cooperative Work (pp. 175- choose products, taking into account his/her preferences.
186). Web Site Personalization: Web sites that are person-
Ricci, F., Venturini, A., Cavada, D., Mirzadeh, N., Blaas, D., alized to each user, knowing the user interests and needs.
& Nones, M. (2003). Product recommendation with inter-

128

TEAM LinG
129

Categorization Process and Data Mining C


Maria Suzana Marc Amoretti
Federal University of Rio Grande do Sul (UFRGS), Brazil

INTRODUCTION rization process. The computational techniques from sta-


tistics and pattern recognition are used to do this data-
For some time, the fields of computer science and cogni- mining practice.
tion have diverged. Researchers in these two areas know
ever less about each others work, and their important
discoveries have had diminishing influence on each other. BACKGROUND
In many universities, researchers in these two areas are in
different laboratories and programs, and sometimes in Semiotics is a theory of the signification (representations,
different buildings. One might conclude from this lack of symbols, categories) and meaning extraction. It is a strongly
contact that computer science and semiotics functions, multi-disciplinary field of study, and mathematical tools
such as perception, language, memory, representation, of semiotics include those used in pattern recognition.
and categorization, reflect our independent systems. But Semiotics is also an inclusive methodology that incorpo-
for the last several decades, the divergence between rates all aspects of dealing with symbolic systems of
cognition and computer science tends to disappear. These signs. Signification join all the concepts in an elementary
areas need to be studied together, and the cognitive structure of signification. This structure is a related net
science approach can afford this interdisciplinary view. that allows the construction of a stock of formal defini-
This article refers to the possibility of circulation be- tions such as semantic category. Hjelmeslev considers
tween the self-organization of the concepts and the rel- the category as a paradigm, where elements can be intro-
evance of each conceptual property of the data-mining duced only in some positions.
process and especially discusses categorization in terms of Text categorization process, also known as text clas-
a prototypical theory, based on the notion of prototype and sification or topic spotting, is the task of automatically
basic level categories. Categorization is a basic means of classifying a set of documents into categories from a
organizing the world around us and offers a simple way to predefined set. This task has several applications, includ-
process the mass of stimuli that one perceives every day. ing automated indexing of scientific articles according to
The ability to categorize appears early in infancy and has an predefined thesauri of technical terms, filing patents into
important role for the acquisition of concepts in a prototypi- patent directories, and selective dissemination of infor-
cal approach. Prototype structures have cognitive represen- mation to information users. There are many new catego-
tations as representations of real-world categories. rization methods to realize the categorization task, includ-
The senses of the English words cat or table are ing, among others, (1) the language model based classi-
involved in a conceptual inclusion in which the extension fication; the maximum entropy classification, which is a
of the superordinated (animal/furniture) concept includes probability distribution estimation technique used for a
the extension of the subordinated (Persian cat/dining variety of natural language tasks, such as language mod-
room table) concept, while the intention of the more eling, part-of-speech tagging, and text segmentation (the
general concept is included by the intention of the more theory underlying maximum entropy is that without exter-
specific concept. This study is included in the categori- nal knowledge, one should prefer distributions that are
zation process. Categorization is a fundamental process uniform); (2) the Nave Bayes classification; (3) the Near-
of mental representation used daily for any person or for est Neighbor (the approach clusters words into groups
any science, and it is also a central problem in semiotics, based on the distribution of class labels associated with
linguistics, and data mining. each word); (4) distributional clustering of words to
Data mining also has been defined as a cognitive document classification; (5) the Latent Semantic Index-
strategy for searching automatically new information ing (LSI), in which we are able to compress the feature
from large datasets or selecting a document, which is space more aggressively, while still maintaining high
possible with computer science and semiotics tools. Data document classification accuracy (this information re-
mining is an analytic process to explore data in order to trieval method improves the users ability to find relevant
find interesting pattern motifs and/or variables in the information, the text categorization method based on a
great quantity of data; it depends mostly on the catego- combination of distributional features with a Support

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Categorization Process and Data Mining

Vector Machine (SVM) classifier, and the feature se- technologies in the categorization process through the
lection approach uses distributional clustering of words making of conceptual maps, especially the possibility
via the recently introduced information bottleneck of creating a collaborative map made by different users,
method, which generates a more efficient representa- points out the cultural aspects of the concept represen-
tion of the documents); (6) the taxonomy method, based tation in terms of existing coincidences as to the choice
on hierarchical text categorization that documents are of the prototypical element by the same cultural group.
assigned to leaf-level categories of a category tree (the Thus, the technologies of information, focused on the
taxonomy is a recently emerged subfield of the seman- study of individual maps, demand revisited discussions
tic networks and conceptual maps). After the previous on the popular perceptions concerning concepts used
work in hierarchical classification focused on docu- daily (folk psychology). It aims to identify ideological
ments category, the tree classify method internal cat- similarity and cognitive deviation, both based on the
egories with a top down level-based classification that prototypes and on the levels of categorization developed
can classify concepts in the document. in the maps, with an emphasis on the cultural and semiotic
aspects of the investigated groups.
The networks of semantic values thus created and It attempted to show how the semiotic and linguistic
stabilized constitute the cultural-metaphorical worlds analysis of the categorization process can help in the
which are discursively real for the speakers of particular identification of the ideological similarity and cognitive
languages. The elements of these networks, though deviations, favoring the involvement of subjects in the
ultimately rooted in the physical-biological realm can map production, exploring and valuing the relation be-
and do operate independently of the latter, and form the tween the categorization process and the cultural experi-
stuff of our everyday discourses (Manjali, 1997, p. 1). ence of the subject in the world, both parts of the cogni-
tive process of conceptual map construction.
The prototype has given way to a true revolution (the
Roschian revolution) regarding classic lexical semantics. The concept maps, or the semantic nets, are space graphic
If we observe the conceptual map for chair, for instance, representations of the concepts and their relationships.
we will realize that the choice of most representative chair The concept maps represent, simultaneously, the
types; that is, our prototype of chair, supposes a double organization process of the knowledge, by the
adequacy: referential because the sign (concept of chair) relationships (links) and the final product, through the
must integrate the features retained from the real or concepts (nodes). This way, besides the relationship
imaginary world, and structural, because the sign must be between linguistic and visual factors is the interaction
pertinent (ideological criterion) and distinctive concern- among their objects and their codes (Amoretti, 2001, p.
ing the other neighbor concepts of chair. When I say that 49).
this object is a chair, it is supposed that I have an idea of
the chair sign, forming the use of a lexical or visual image The building of a map involves collaboration when the
competence coming from my referential experience, and subjects/students/users share information, still without
that my prototypical concept of chair is more adequate modifying the data, and involves cooperation, when users
than its neighbors bench or couch, because I perceive not only share their knowledge but also may interfere and
that there is a back part and there are no arms. Then, it is modify the information received from the other users, acting
useless to try to explain the creation of a prototype inside in a asynchronized way to build a collective map. Both
a language, because it is formed from context interactions. cooperation and collaboration attest the autonomy of the
The double origin of a prototype is bound, then, to ongoing cognitive process, the direction given by the users
shared knowledge relation between the subjects and themselves when trying to adequate their knowledge.
their communities (Amoretti, 2003). When people do a conceptual map, they usually privi-
lege the level where the prototype is. The basic concept
map starts with a general concept at the top of the map and
MAIN THRUST then works its way down through a hierarchical structure
to more specific concepts. The empirical concept (Kant)
Hypertext poses new challenges for a data-mining pro- of cat and chair has been studied by users with map
cess, especially for text categorization research, because software. They make an initial map at the beginning of the
metadata extracted from Web sites provide rich informa- semester and another about the same subject at the end
tion for classifying hypertext documents, and it is a new of the semester. I first discussed how cats and chairs
kind of problem to solve, to know how to appropriately appear, what could be called the structure of cat and chair
represent that information and automatically learn statis- appearance. Second, I discussed how cat and chair are
tical patterns for hypertext categorization. The use of perceived and which attributes make a cat a cat and a chair

130

TEAM LinG
Categorization Process and Data Mining

a chair. Finally, I will consider cat and chair as an experi- same category. A chair is a more central member than a
ential category, so the point of departure is our experience television, which, in turn, is a rather marginal member. C
in the world about cat and chair. The acquisition of the Rosch (2000) claims that prototypes can only constrain,
concept cat and chair is mediated by concrete experiences. but do not determine, models of representations.
Thus, the learner must possess relevant prior knowledge The main thrust of my argument is that it is very
and a mental scheme to acquire a prototypical concept. important to data mining to know, besides the cognitive
The expertise changes the conceptual level organiza- categorization process, what is the prototypical concept
tion competences. In the first maps, the novice chair map in a dataset. This basic level of concept organization
privileged the basic level, the most important exemplar of reflects the social representation in a better way than the
a class, the chair prototype. This level has a high coher- other levels (i.e., superordinate and subordinate levels).
ence and distinctiveness. After thinking about this con- The prototype knowledge affords a variety of culture
cept, students (now chair experts) repeated the experiment representations. The conceptual mapping system con-
and carried out again the expert chair map with much more tains prototype data on the hierarchical way of concept
details in the superordinate level, showing eight different levels. Using different software (three measuring lev-
kinds of chairs: dining room chair, kitchen chair, garden elssuperordinate, basic, and subordinate), I suggest
chair, and so forth. This level has high coherence and low the construction of different maps for each concept to
distinctiveness (Rosh, 2000). So, users learn by doing the analyze the categorization cognitive process with maps
categorization process. and to show how the categorization performance of
Language system arbitrarily cuts up the concepts into individual and collective or organizational team overtime
discrete categories (Hjelmeslev, 1968), and all categories is important in a data-mining work.
have equal status. Human language is both natural and Categorization is a part of Jakobsons (2000) commu-
cultural. According to the prototype theory, the role played nication model (also appropriated from information
by non-linguistic factors like perception and environment theory) with cultural aspects (context). This principle
is demonstrated throughout the concept as the prototype allows to locate on a definite gradient objects and rela-
from the subjects of each community. tions that are observed, based on similarity and contigu-
A concept is a sort of scheme. An effective way of ity associations (frame/script semantic) (Schank, 1999)
representing a concept is to retain only its most important and based on hierarchically relations (prototypic seman-
properties. This group of most important properties of a tic) (Kleiber, 1990; Rosch, 2000), in terms of perceived
concept is called prototype. The idea of prototype makes family resemblance among category members.
it possible for the subject to have a mental construction,
identifying the typical features of several categories, and,
when the subject finds a new object, he or she may compare FUTURE TRENDS
it to the prototype in his or her memory. Thus, the proto-
type of chair, for instance, allows new objects to be Data-mining language technology systems typically have
identified and labeled as chairs. In individual conceptual focused on the factual aspect of content analysis. How-
maps creation, one may confirm the presence of variables ever, there are other categorization aspects, including
for the same concept. pragmatics, point of view, and style, which must receive
The notion of prototype originated in the 1970s, greatly more attention like types and models of subjective clas-
due to Eleanor Roschs (2000) psychological research on sification information and categorization characteristics
the organization of conceptual categories. Its revolution- such as centrality, polarity, intensity, and different lev-
ary character marked a new era for the discussions on els of granularity (i.e., expression, clause, sentence,
categorization and brought existing theories, such as the discourse segment, document, hypertext).
classical view of the prototype question. On the basis of It is also important to define properties heritage
Roschs results, it is argued that members of the so-called among different category levels, viewed throughout hi-
Aristotelian (or classical) categories share all the same erarchical relations as one that allowed to virtually add
properties and showed that categories are structured in an certain pairs of value (attributes from a unit to another).
entirely different way; members that constitute them are We should also think of concepts managing that, in a
assigned in terms of gradual participation, and the cat- given category, are considered as an exception. It would
egorical attribution is made by human beings according to be necessary to allow the heritage blockage of certain
the more or less centrality/marginality of collocation within attributes. I will be opening new perspectives to the data-
the categorical structure. Elements recognized as central mining research of categorization and prototype study,
members of the category represent the prototype. For which shows the ideological similarity perception medi-
instance, a chair is a very good example of the category ated by collaborative conceptual maps.
furniture, while a television is a less typical example of the

131

TEAM LinG
Categorization Process and Data Mining

Much is still unknowable about the future of data Andler, D. (1987). Introduction aux sciences cognitives.
mining in higher education and in the business intelli- Paris: Gallimard.
gence process. The categorization process is a factor that
will affect this future and can be identified with the crucial Cordier, F. (1989). Les notions de typicalit et niveau
role played by the prototypes. Linguistics have not yet dabstracion: Analyse des proprits des representa-
paid this principle due attention. However, some conse- tions [thse de doctorat ddtat]. Paris: Sud University.
quences should already necessarily follow from its proto- das, J. (1966). Smantique structurale. Recherche de
type recognition. The extremely powerful explanation of mthode. Paris: PUF.
prototype categorization constitutes the most salient
feature in data mining. So, a very important application in Frawley, W., Piatetsky-Shapiro, G., & Matheus, C. (1992).
the data-mining methodology is the results of the proto- Knowledge discovery in databases: An overwiev. AI
type categorization research like a form of retrieval of Magazine, 13(2), 57-70.
unexpected information. Greimas, A.J. (1966). Smantique structurale. Recherche
de mthode. Paris: PUF.

CONCLUSION Hand, D., Mannila, H., & Smyth, P. (2001). Principles of


data mining. Boston: MIT Press.
Categorization explains aspects of peoples cultural logic. Hjelmeslev, L. (1968). Prolgomnes une thorie du
In this chapter, statistical pattern recognition approaches langage. Paris: ditions Minuit.
are used to classify concepts which are present in a given
dataset and expressed by conceptual maps. Prototypes Jakobson, R. (2000). Linguistics and poetics. Londres:
whose basis is the concepts representation from the Lodge and Wood.
heritage par dfaut that allows a great economy in the
acquisition and managing of the information. The objec- Kleiber, G. (1990). La smantique du prototype: Catgories
tive of this study was to investigate the potential of et sens lexical. Paris: PUF.
founding the categories in the text with conceptual maps, Manjali, F.D. (1997). Dynamical models in semiotics/
to use these maps as data-mining tools. Based on user semantics. Retrieved from http://www.chass.utoronto.ca/
cognitive characteristics of knowledge organization and epc/srb/cyber/manout.html
based on the prevalence of the basic level (the prototypi-
cal level), a case is of the necessity of the categorization Minksy, M.L. (1977). A framework for representing knowl-
cognitive process like as a step of data mining. edge. In P.H. Winston (Ed.), The psychology of computer
vision (pp. 211-277). New York: McGraw-Hill.
Rosch, E. et al. (2000). The embodied mind. London: MIT.
REFERENCES
Schank, R. (1999). Dynamic memory revisited. Cambridge,
Amoretti, M.S.M. (2001). Prottipos e esteretipos: MA: Cambridge University Press.
aprendizagem de conceitos: Mapas conceituais: Uma Sebastiani, F. (2002). Machine learning in automated text
experincia em Educao Distncia. Proceedings of the categorization. ACM Computing Surveys (CSUR).
Revista Informtica na Educao. Teoria e Prtica,
Porto Alegre, Brazil. Yiming, Y. (1999). An evaluation of statistical approaches
to text categorization. Information Retrieval.
Amoretti, M.S.M. (2003). Conceptual maps: A metacognitive
strategy to learn concepts. Proceedings of the 25th Annual
Meeting of the Cognitive Science Society, Boston, Mas-
sachusetts. KEY TERMS
Amoretti, M.S.M. (2004a). Categorization process and
conceptual maps. Proceedings of the First International Categorization: A cognitive process based on simi-
Conference on Concept Mapping, Pamplona, Spain. larity of mental schemes and concepts that subjects
establish conditions that are both necessary and suffi-
Amoretti, M.S.M. (2004b). Collaborative learning con- cient (properties) to capture meaning and/or the hierarchy
cepts in distance learning: Conceptual map: Analysis of inclusion (as part of a set) by family ressemblences shared
prototypes and categorization levels. Proceedings of the by their members. Every category has a prototypical
CCM Digital Government Symposium, Tuscaloosa, Ala- internal structure, depending on the context.
bama.

132

TEAM LinG
Categorization Process and Data Mining

Concept: A sort of scheme produced by repeated Prototype: An effective way of representing a concept
experiences. Concepts are essentially each little idea that is to retain only its most important properties or the most C
we have in our heads about anything. This includes not typical element of a category, which serves as a cognitive
only everything, but every attribute of everything. reference point with respect to a cultural community. This
group of most important properties or most typical ele-
Conceptual Maps: Semiotic representation (linguistic ments of a concept is called prototype. The idea of
and visual) of the concepts (nodes) and their relation- prototype makes possible that the subject has a mental
ships (links); represent the organization process of the construction, identifying the typical features of several
knowledge When people do a conceptual map, they categories. Prototype is defined as the object that is a
usually privilege the level where the prototype is. They categorys best model.
prefer to categorize at an intermediate level; this basic
level is the first level learned, the most common level
named, and the most general level where visual shape and
attributes are maintained.

133

TEAM LinG
134

Center-Based Clustering and Regression


Clustering
Bin Zhang
Hewlett-Packard Research Laboratories, USA

INTRODUCTION
With guarantee of convergence to only a local opti-
Center-based clustering algorithms are generalized to mum, the quality of the converged results, measured by
more complex model-based, especially regression- the performance function of the algorithm, could be far
model-based, clustering algorithms. This article briefly from its global optimum. Several researchers explored
reviews three center-based clustering algorithmsK- alternative initializations to achieve the convergence to
Means, EM, and K-Harmonic Meansand their gener- a better local optimum (Bradley & Fayyad, 1998; Meila
alizations to regression clustering algorithms. More & Heckerman, 1998; Pena et al., 1999).
details can be found in the referenced publications. K-Harmonic Means (KHM) (Zhang, 2001; Zhang et
al., 2000) is a recent addition to the family of center-
based clustering algorithms. KHM takes a very different
BACKGROUND approach from improving the initializations. It tries to
address directly the source of the problema single
Center-based clustering is a family of techniques with cluster is capable of trapping far more centers than its
applications in data mining, statistical data analysis fair share. This is the main reason for the existence of a
(Kaufman et al., 1990), data compression (vector quanti- very large number of local optima under K-Means and
zation) (Gersho & Gray, 1992), and many others. K- EM when K>10. With the introduction of a dynamic
means (KM) (MacQueen, 1967; Selim & Ismail, 1984), weighting function of data, KHM is much less sensitive
and the Expectation Maximization (EM) (Dempster et al., to initialization, demonstrated through a large number
1977; McLachlan & Krishnan, 1997; Rendner & Walker, of experiments in Zhang (2003). The dynamic weighting
1984) with linear mixing of Gaussian density functions function reduces the ability of a single data cluster,
are two of the most popular clustering algorithms. trapping many centers.
K-Means is the simplest among the three. It starts Replacing the point-centers by more complex data
model centers, especially regression models, in the
with initializing a set of centers M = {mk | k = 1,..., K } and second part of this article, a family of model-based
iteratively refines the location of these centers to find clustering algorithms is created. Regression clustering
the clusters in a dataset. Here are the steps: has been studied under a number of different names:
Clusterwise Linear Regression by Spath (1979, 1981,
K-Means Algorithm 1983, 1985), DeSarbo and Cron (1988), Hennig (1999,
2000) and others; Trajectory clustering using mixtures
Step 1: Initialize all centers (randomly or based of regression models by Gaffney and Smith (1999);
on any heuristic). Fitting Regression Model to Finite Mixtures by Will-
Step 2: Associate each data point with the nearest iams (2000); Clustering Using Regression by Gawrysiak,
et. al. (2000); Clustered Partial Linear Regression by
center. This step partitions the data set into K
Torgo, et. al. (2000). Regression clustering is a better
disjoint subsets (Voronoi Partition).
name for the family, because it is not limited to linear or
Step 3: Calculate the best center locations (i.e., the
piecewise regressions.
centroids of the partitions) to maximize a perfor- Spath (1979, 1981, 1982) used linear regression and
mance function (2), which is the total squared dis- partition of the dataset, similar to K-means, in his
tance from each data point to the nearest center. algorithm that locally minimizes the total mean square
Step 4: Repeat Steps 2 and 3 until there are no error over all K-regressions. He also developed an
more changes on the membership of the data points incremental version of his algorithm. He visualized his
(proven to converge). piecewise linear regression concept in his book (Spath,

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Center-Based Clustering and Regression Clustering

1985) exactly as he named his algorithm. DeSarbo (1988) where Sk X is the subset of x that are closer to mk than
used a maximum likelihood method for performing to all other centers (the Voronoi partition). C
clusterwise linear regression. Hennig (1999) studied clus-
tered linear regression, as he named it, using the same linear
K
mixing of Gaussian density functions. 1
(b) EM: d ( x, M ) = log pk * EXP( || xi ml ||2 ) , where
( )
D
k =1

MAIN THRUST { pk }1K is a set of mixing probabilities.

For K-Means, EM, and K-Harmonic means, both their A linear mixture of K identical spherical (Gaussian
performance functions and their iterative algorithms density) functions, which is still a probability density
are treated uniformly in this section for comparison. function, is used here.
This uniform treatment is carried over to the three
regression clustering algorithms, RC-KM, RC-EM and (c) K-Harmonic Means: d ( x, M ) = 1HA (|| x mk || p ) , the
RC-KHM, in the second part. k K

harmonic average of the K distances,


Performance Functions of the Center-
Based Clustering K
Perf KHM ( X , M ) =
1 , where p > 2.
Among many clustering algorithms, center-based clus-
xX
|| x m ||
mM
2
(3)
i l
tering algorithms stand out in two important aspects
a clearly defined objective function that the algorithm K-Means and K-Harmonic Means performance func-
minimizes, compared with agglomerative clustering al- tions also can be written similarly to the EM, except that
gorithms that do not have a predefined objective; and a only a positive function takes the place where this
low runtime cost, compared with many other types of probability function is (Zhang, 2001).
clustering algorithms. The time complexity per itera-
tion for all three algorithms is linear in the size of the
Center-Based Clustering Algorithms
dataset N, the number of clusters K, and the dimension-
ality of data D. The number of iterations it takes to
K-Means algorithms are shown in the Introduction. We
converge is very insensitive to N.
list EM and K-Harmonic Means algorithms here to
Let X = {xi | i = 1,..., N } be a dataset with K clusters, show their similarity.
iid sampled from a hidden distribution, and
M = {mk | k = 1,..., K } be a set of K centers. K-Means, EM, EM (with Linear Mixing of Spherical
and K-Harmonic Means find the clustersthe (local) Gausian Densities) Algorithm
optimal locations of the centersby minimizing a func-
tion of the following form over the K centers, Step 1: Initialize the centers and the mixing prob-
abilities { pk }1K .
Perf ( X , M ) = d ( x, M ) (1) Step 2: Calculate the expected membership prob-
xX
abilities (see item <B>).
where d(x,M) measures the distance from a data point to Step 3: Maximize the likelihood to the current
the set of centers. Each algorithm uses a different membership by finding the best centers.
distance function: Step 4: Repeat Steps 2 and 3 until a chosen con-
vergence criterion is satisfied.
(a) K-Means: d ( x, M ) = MIN (|| x mk ||) , which makes
1 k K
K-Harmonic Means Algorithm
(1) the same as the more popular form
Step 1: Initialize the centers.
K
Perf KM ( X , M ) = || x mk || ,
2
(2)
Step 2: Calculate the membership probabilities
k =1 xSk and the dynamic weighting (see item <C>).

135

TEAM LinG
Center-Based Clustering and Regression Clustering

Step 3: Find the best centers to maximize its perfor- N


mk(u ) =
1 N
1
xi , di ,k =|| xi mk(u 1) ||
mance function.
2 2
K 1 K 1
i =1
dip,k+ 2 p
i =1
d p+2
p (7)

i ,k
Step 4: Repeat Steps 2 and 3 until a chosen conver- l =1 d i,l l =1 d i,l

gence criterion is satisfied.


as (Zhang, 2001)
All three iterative algorithms also can be written uni-
formly as: K
1
d p+2
1
p +2
a ( xi ) =
k =1 i ,k d
p(m p(mk( u 1) | xi ) = K ,k , i = 1,..., N
i
( u 1)
k | x) a ( u 1) ( x) x K 1
2
and 1 . (8)
m (u )
= xX
, p p +2
l =1 d i ,l
k
p(m ( u 1)
k | x) a (u 1) ( x) k =1 di ,k
xX

The dynamic weighting function a ( x) 0 approaches


K
zero when x approaches one of the centers. Intuitively,
a ( u 1)
( x) > 0, p(m ( u 1)
k | x ) 0 and p(m
l =1
( u 1)
l | x) = 1 (4)
the closer a data point is to a center, the smaller weight
it gets in the next iteration. This weighting reduces the
(We dropped the iteration index u-1 on p() for shorter ability of a cluster trapping more than one center. The
effect is clearly observed in the visualization of hundreds
notations). Function a ( u 1) ( x) is a weight on the data point
of experiments conducted (see next section). Compared
x in the current iteration. It is called a dynamic weighting to the KHM, both KM and EM have all data points fully
function, because it changes in each iteration. Functions participate in all iterations (weighting function is a con-
p (mk(u 1) | x) are soft-membership functions, or the prob- stant 1). They do not have a dynamic weighting function
ability of x being associated to the center mk( u 1) . For each as K-Harmonic Means does. EM and KHM both have
soft-membership functions, but K-Means has 0/1 mem-
algorithm, the details on a() and p(,) are: bership function.

A. K-Means: a ( u 1) ( x) =1 for all x in all iterations, and Empirical Comparisons of Center-


p (m ( u 1)
k | x ) =1 if m ( u 1)
k is the closest center to x, Based Clustering Algorithms
otherwise p (mk( u 1) | x) =0. Intuitively, each x is 100%
Empirical comparisons of K-Means, EM, and K-Har-
associated with the closest center, and there is no monic Means on 1,200 randomly generated data sets
weighting on the data points. can be found in the paper by Zhang (2003). Each data set
B. EM: a ( u 1) ( x) =1 for all x in all iterations, and has 50 clusters ranging from well-separated to signifi-
cantly overlapping. The dimensionality of data ranges
p (mk( u 1) | x )
from 2 to 8. All three algorithms are run on each
dataset, starting from the same initialization of the
p ( x | mk(u 1) ) * p (mk(u 1) ) centers, and the converged results are measured by a
p (mk( u 1) | x ) = , common quality measurethe K-Meansfor com-
p( x | mk(u 1) ) * p(mk(u 1) )
xX
(5)
parison. Sensitivity to initialization is studied by a
rerun of all the experiments on different types of
initializations. Major conclusions from the empirical
1
p (mk(u 1) ) = p(mk(u 2) | x),
| X | xX
(6) study are as follows:

1. For low dimensional datasets, the performance


and p ( x | m ( u 1)
k ) is the spherical Gausian density ranking of three algorithms is KHM > KM > EM
(> means better). For low dimensional datasets
function centered at mk( u 1) . (up to 8), the difference is significant.
2. KHMs performance has the smallest variation
C. K-Harmonic Means: a ( x) and p (m ( u 1) ( u 1)
| x ) are under different datasets and different
k
initializations. EMs performance has the biggest
extracted from the KHM algorithm:
variation. Its results are most sensitive to
initializations.

136

TEAM LinG
Center-Based Clustering and Regression Clustering

3. Reproducible results become even more important Regression in (9) is not effective when the dataset
when we use these algorithms to different datasets contains a mixture of very different response character- C
that are sampled from the same hidden distribution. istics, as shown in Figure 1a; it is much better to find the
The results from KHM better represent the proper- partitions in the data and to learn a separate function on
ties of the distribution and are less dependent on a each partition, as shown in Figure 1b. This is the idea of
particular sample set. EMs results are more depen- Regression-Clustering (RC). Regression provides a model
dent on the sample set. for the clusters; clustering partitions the data to best fit
the models. The linkage between the two algorithms is a
The details on the setup of the experiments, quantita- common objective function shared between the regres-
tive comparisons of the results, and the Matlab source sions and the clustering.
code of K-Harmonic Means can be found in the paper. RC algorithms can be viewed as replacing the K
geometric-point centers in center-based clustering al-
Generalization to Complex Model- gorithms by a set of model-based centers, particularly a
Based ClusteringRegression set of regression functions M = { f1 ,..., f K } . With
Clustering the same performance function as defined in (1), but the
distance from a data point to the set of centers replaced
Clustering applies to datasets without response infor- by the following, ( e( f ( x), y) =|| f ( x) y ||2 )
mation (unsupervised); regression applies to datasets
with response variables chosen. Given a dataset with
a) d (( x, y ), M ) = MIN (e( f ( x ), y )) for RC with K-Means
responses, Z = ( X , Y ) = {( xi , yi ) | i = 1,..., N } , a family of f M

(RC-KM)
functions = { f } (a function class making the optimi-
K
zation problem well defined, such as polynomials of up 1
b) d (( x, y ), M ) = log pk * EXP( e( f ( x), y )) for EM
( )
D
to a certain degree) and a loss function e() 0 , regres- k =1

sion solves the following minimization problem (Mont- and
gomery et al., 2001):
c) d (( x, y ), M ) = HA(e( f ( x ), y )) for RC K-Harmonic
f M

N Means (RC-KHM).
f opt = arg min e( f ( xi ), yi ) (9)
f i =1
The three iterative algorithmsRC-KM, RC-EM, and
RC-KHMminimizing their corresponding performance
m
function, take the following common form (10). Regres-
Commonly, = { l h( x, al ) | l R, al R n } , linear
l =1 sion with weighting takes the places of weighted averag-
expansion of simple parametric functions, such as poly- ing in (4). The regression function centers in the uth
nomials of degree up to m, Fourier series of bounded iteration are the solution of the minimization,
frequency, neural networks. Usually,
e( f ( x), y ) =|| f ( x) y || , with p=1,2 most widely used
p N
f k( u ) = arg min a ( zi ) p ( Z k | zi ) || f ( xi ) yi ||2 (10)
(Friedman, 1999). f i =1

where the weighting a ( zi ) and the probability p( Z k | zi )


of data point zi in cluster Z k are both calculated from
Figure 1. (a) Left: a single function is regressed on all
training data, which is a mixture of three different the (u-1)-iterations centers { f k( u 1) } as follows:
distributions; (b) Right: three regression functions,
each regressed on a subset found by RC. The residue (a) For RC-K-Means, a( zi ) = 1 and p( Z k | zi ) =1 if
errors are much smaller.
e( f k(u 1) ( xi ), yi ) < e( f k('u 1) ( xi ), yi ) k ' k , otherwise
p ( Z k | zi ) =0. Intuitively, RC-K-Means has the fol-
lowing steps:

137

TEAM LinG
Center-Based Clustering and Regression Clustering

Step 1: Initialize the regression functions. Empirical Comparison of Center-Based


Step 2: Associate each data point (x,y) with the Clustering Algorithms
regression function that provides the best ap-
Comparison of the three RC-algorithms on randomly gen-
proximation ( arg kmin{e( f k ( x ), y ) | k = 1,..., K } .
erated datasets can be found in the paper by Zhang

Step 3: Recalculate the regression function on (2003a). RC-KHM is shown to be less sensitive to
each partition that maximizes the performance initialization than RC-KM and RC-EM. Details on imple-
function. menting the RC algorithms with extended linear regres-
Step 4: Repeat Steps 2 and 3 until no more data sion models are also available in the same paper.
points change their membership.
Comparing these steps with the steps of K-Means, the
only differences are that point-centers are replaced by FUTURE TRENDS
regression-functions; distance from a point to a center is
replaced by the residue error of a pair (x,y) approximated Improving the understanding of dynamic weighting to
by a regression function. the convergence behavior of clustering algorithms and
finding systemic design methods to develop better per-
(b) For RC-EM, a( zi ) = 1 and forming clustering algorithms require more research.
Some of the work in this direction is appearing. Nock
and Nielsen (2004) took the dynamic weighting idea and
1
pk(u 1) EXP ( e( f k(u 1) ( xi ), yi ))
2
developed a general framework similar to boosting
p ( u ) ( Z k | zi ) = 1 N


K
1
pk(u 1) EXP( e( f k(u 1) ( xi ), yi ))
and pk(u 1) =
N
p( Z
i =1
( u 1)
k | zi ) . theory in supervised learning.
k =1 2 Regression clustering will find many applications in
analyzing real-word data. Single-function regression
The same parallel structure can be observed between has been used very widely for data analysis and forecast-
the center-based EM clustering algorithm and the RC- ing. Data collected in an uncontrolled environment, like
EM algorithm. in stocks, marketing, economy, government census, and
many other real-world situations, are very likely to
(c) For RC-K-Harmonic Means, with contain a mixture of different response characters. Re-
e( f ( x), y ) =|| f ( xi ) yi || , p' gression clustering is a natural extension to the classi-
cal single-function regression.
K K K
a p ( zi ) = dip,l' + 2 d p'
i ,l
p '+ 2
and p ( Z k | zi ) = di ,k d p '+ 2
i ,l
l =1 l =1 l =1 CONCLUSION

where d i ,l =|| f (u 1) ( xi ) yi || . ( p > 2 is used.) Replacing the simple geometric-point centers in cen-
The same parallel structure can be observed between ter-based clustering algorithms by more complex data
the center-based KHM clustering algorithm and the RC- models provides a general scheme for deriving other
KHM algorithm. model-based clustering algorithms. Regression models
Sensitivity to initialization in center-based cluster- are used in this presentation to demonstrate the process.
ing carries over to regression clustering. In addition, a The key step in the generalization is defining the dis-
new form of local optimum is illustrated in Figure 2. tance function from a data point to the set of models
It happens to all three RC algorithms, RC-KM, RC- the regression functions in this special case.
KHM, and RC-EM. Among the three algorithms, EM has a strong foun-
dation in probability theory. It is the convergence to
only a local optimum and the existence of a very large
Figure 2. A new kind of local optimum occurs in number of optima when the number of clusters is more
regression clustering. than a few (>5, for example) that keeps practitioners
from the benefits of its theory. K-Means is the simplest
and its objective function the most intuitive. But it has
the similar problem as the EMs sensitivity to initializa-
tion of the centers. K-Harmonic Means was developed
with close attention to the dynamics of its convergence;
it is much more robust than the other two on low dimen-

138

TEAM LinG
Center-Based Clustering and Regression Clustering

sional data. Improving the convergence of center-based gence Artificial in Uncertainty (pp. 386-395). Morgan
clustering algorithms on higher dimensional data (dim > Kaufman. C
10) still needs more research.
Montgomery, D.C., Peck, E.A., & Vining, G.G. (2001).
Introduction to linear regression analysis. John Wiley
& Sons.
REFERENCES
Nock, R., & Nielsen, F. (2004). An abstract weighting
Bradley, P., & Fayyad, U.M. (1998). Refining initial points framework for clustering algorithms. Proceedings of
for KM clustering. MS Technical Report MSR-TR-98-36. the Fourth International SIAM Conference on Data
Mining. Orlando, Florida.
Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maxi-
mum likelihood from incomplete data via the EM algo- Pena, J., Lozano, J., & Larranaga, P. (1999). An empirical
rithm. Journal of the Royal Statistical Society, 39(1), 1- comparison of four initialization methods for the K-
38. means algorithm. Pattern Recognition Letters, 20,
1027-1040.
DeSarbo, W.S., & Corn, L.W. (1988). A maximum likeli-
hood methodology for clusterwise linear regression. Jour- Rendner, R.A., & Walker, H.F. (1984). Mixture densi-
nal of Classification, 5, 249-282. ties, maximum likelihood and the EM algorithm. SIAM
Review, 26(2).
Duda, R., & Hart, P. (1972). Pattern classification and
scene analysis. John Wiley & Sons. Schapire, R.E. (1999). Theoretical views of boosting and
applications. Proceedings of the Tenth International
Friedman, J., Hastie, T., & Tibshirani. R. (1998). Additive Conference on Algorithmic Learning Theory.
logistic regression: A statistical view of boosting [tech-
nical report]. Department of Statistics, Stanford Univer- Selim, S.Z., & Ismail, M.A (1984). K-means type algo-
sity. rithms: A generalized convergence theorem and charac-
terization of local optimality. IEEE Transactions on
Gersho, A., & Gray, R.M. (1992). Vector quantization and PAMI-6, 1.
signal compression. Kluwer Academic Publishers.
Silverman, B.W. (1998). Density estimation for statis-
Hamerly, G., & Elkan, C. (2002). Alternatives to the k- tics and data analysis. Chapman & Hall/CRC.
means algorithm that find better clusterings. Proceed-
ings of the ACM conference on information and knowl- Spath, H. (1981). Correction to algorithm 39:
edge management (CIKM). Clusterwise linear regression. Computing, 26, 275.
Hamerly, G., & Elkan, C. (2003). Learning the k in k-means. Spath, H. (1982). Algorithm 48: A fast algorithm for
Proceedings of the Seventeenth Annual Conference on clusterwise linear regression. Computing, 29, 175-181.
Neural Information Processing Systems.
Spath, H. (1985). Cluster dissection and analysis. New
Hennig, C. (1997). Datenanalyse mit modellen fur cluster York: Wiley.
linear regression [Dissertation]. Hamburg, Germany:
Institut Fur Mathmatsche Stochastik, Universitat Ham- Tibshirani, R., Walther, G., & Hastie, T. (2000). Estimating
burg. the number of clusters in a dataset via the gap statistic.
Retrieved from http://www-stat.stanford.edu/~tibs /
Kaufman, L., & Rousseeuw, P.J. (1990). Finding groups in research.html
data: An introduction to cluster analysis. John Wiley &
Sons Zhang, B. (2001). Generalized K-harmonic meansDy-
namic weighting of data in unsupervised learning. Pro-
MacQueen, J. (1967). Some methods for classification and ceedings of the First SIAM International Conference on
analysis of multivariate observations. Proceedings of the Data Mining (SDM2001), Chicago, Illinois.
Fifth Berkeley Symposium on Mathematical Statistics
and Probability, Berkeley, California. Zhang, B. (2003). Comparison of the performance of cen-
ter-based clustering algorithms. Proceedings of PAKDD-
McLachlan, G. J., & Krishnan, T. (1997). EM algorithm 03, Seoul, South Korea.
and extensions. John Wiley & Sons.
Zhang, B. (2003a). Regression clustering. Proceedings of
Meila, M., & Heckerman, D. (1998). An experimental com- the IEEE International Conference on Data Mining,
parison of several clustering and initialization methods. In Melbourne, Florida.
Proceedings of the Fourteenth Conference on Intelli-

139

TEAM LinG
Center-Based Clustering and Regression Clustering

Zhang, B., Hsu, M., & Dayal, U. (2000). K-harmonic means. Model-Based Clustering: A mixture of simpler distri-
Proceedings of the International Workshop on Tempo- butions is used to fit the data, which defines the clusters
ral, Spatial and Spatio-Temporal Data Mining, Lyon, of the data. EM with linear mixing of Gaussian density
France. functions is the best example, but K-Means and K-Har-
monic Means are the same type. Regression clustering
algorithms are also model-based clustering algorithms
with mixing of more complex distributions as its model.
KEY TERMS
Regression: A statistical method of learning the rela-
Boosting: Assigning and updating weights on data tionship between two sets of variables from data. One set
points according to a particular formula in the process of is the independent variables or the predictors, and the
refining classification models. other set is the response variables.

Center-Based Clustering: Similarity among the data Regression Clustering: Combining the regression
points is defined through a set of centers. The distance methods with center-based clustering methods. The
from each data point to a center determined the data points simple geometric-point centers in the center-based clus-
association with that center. The clusters are represented tering algorithms are replaced by regression models.
by the centers. Sensitivity to Initialization: Center-based clus-
Clustering: Grouping data according to similarity tering algorithms are iterative algorithms that minimiz-
among them. Each clustering algorithm has its own defi- ing the value of its performance function. Such algo-
nition of similarity. Such grouping can be hierarchical. rithms converge to only a local optimum of its perfor-
mance function. The converged positions of the centers
Dynamic Weighting: Reassigning weights on the depend on the initial positions of the centers where the
data points in each iteration of an iterative algorithm. algorithm start with.

140

TEAM LinG
141

Classification and Regression Trees C


Johannes Gehrke
Cornell University, USA

INTRODUCTION unlabeled record, we start at the root node. If the record


satisfies the predicate associated with the root node, we
It is the goal of classification and regression to build a follow the tree to the left child of the root, and we go to the
data-mining model that can be used for prediction. To right child otherwise. We continue this pattern through a
construct such a model, we are given a set of training unique path from the root of the tree to a leaf node, where
records, each having several attributes. These attributes we predict the value of the dependent attribute associated
either can be numerical (e.g., age or salary) or categorical with this leaf node. An example decision tree for a classi-
(e.g., profession or gender). There is one distinguished fication problem, a classification tree, is shown in Figure
attributethe dependent attribute; the other attributes 1. Note that a decision tree automatically captures inter-
are called predictor attributes. If the dependent attribute actions between variables, but it only includes interac-
is categorical, the problem is a classification problem. If tions that help in the prediction of the dependent at-
the dependent attribute is numerical, the problem is a tribute. For example, the rightmost leaf node in the example
regression problem. It is the goal of classification and shown in Figure 1 is associated with the classification
regression to construct a data-mining model that predicts rule: If (Age >= 40) and (Gender=male), then YES; as
the (unknown) value for a record, where the value of the classification rule that involves an interaction between
dependent attribute is unknown. (We call such a record an the two predictor attributes age and salary.
unlabeled record.) Classification and regression have a Decision trees can be mined automatically from a
wide range of applications, including scientific experi- training database of records, where the value of the
ments, medical diagnosis, fraud detection, credit approval, dependent attribute is known: A decision tree construc-
and target marketing (Hand, 1997). tion algorithm selects which attribute(s) to involve in the
Many classification and regression models have been splitting predicates, and the algorithm decides also on the
proposed in the literature; among the more popular mod- shape and depth of the tree (Murthy, 1998).
els are neural networks, genetic algorithms, Bayesian
methods, linear and log-linear models and other statistical
methods, decision tables, and tree-structured models, MAIN THRUST
which is the focus of this article (Breiman, Friedman,
Olshen & Stone, 1984). Tree-structured models, so-called Let us discuss how decision trees are mined from a
decision trees, are easy to understand; they are non- training database. A decision tree usually is constructed
parametric and, thus, do not rely on assumptions about in two phases. In the first phase, the growth phase, an
the data distribution; and they have fast construction overly large and deep tree is constructed from the training
methods even for large training datasets (Lim, Loh & Shih, data. In the second phase, the pruning phase, the final size
2000). Most data-mining suites include tools for classifi- of the tree is determined with the goal to minimize the
cation and regression tree construction (Goebel & expected misprediction error (Quinlan, 1993).
Gruenwald, 1999).

Figure 1. An example classification tree


BACKGROUND

Let us start by introducing decision trees. For the ease of Age <
40
explanation, we are going to focus on binary decision
trees. In binary decision trees, each internal node has two
children nodes. Each internal node is associated with a No Gender=M
predicate, called the splitting predicate, which involves
only the predictor attributes. Each leaf node is associated
with a unique value for the dependent attribute. A deci- No Yes
sion encodes a data-mining model as follows. For an

Copyright 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG
Classification and Regression Trees

There are two problems that make decision tree con- databases (Gehrke, Ramakrishnan & Ganti, 2000; Shafer,
struction a hard problem. First, construction of the opti- Agrawal & Mehta, 1996).
mal tree for several measures of optimality is an NP-hard In most classification and regression scenarios, we
problem. Thus, all decision tree construction algorithms also have costs associated with misclassifying a record,
grow the tree top-down according to the following greedy or with being far off in our prediction of a numerical
heuristic: At the root node, the training database is dependent value. Existing decision tree algorithms can
examined, and a splitting predicate is selected. Then the take costs into account, and they will bias the model
training database is partitioned according to the splitting toward minimizing the expected misprediction cost in-
predicate, and the same method is applied recursively at stead of the expected misclassification rate, or the ex-
each child node. The second problem is that the training pected difference between the predicted and true value of
database is only a sample from a much larger population the dependent attribute.
of records. The decision tree has to perform well on
records drawn from the population, not on the training
database. (For the records in the training database, we FUTURE TRENDS
already know the value of the dependent attribute.)
Three different algorithmic issues need to be ad- Recent developments have expanded the types of models
dressed during the tree construction phase. The first that a decision tree can have in its leaf nodes. So far, we
issue is to devise a split selection algorithm, such that the assumed that each leaf node just predicts a constant value
resulting tree models the underlying dependency rela- for the dependent attribute. Recent work, however, has
tionship between the predictor attributes and the depen- shown how to construct decision trees with linear models
dent attribute well. During split selection, we have to make in the leaf nodes (Dobra & Gehrke, 2002). Another recent
two decisions. First, we need to decide which attribute we development in the general area of data mining is the use
will select as the splitting attribute. Second, given the of ensembles of models, and decision trees are a popular
splitting attribute, we have to decide on the actual split- model for use as a base model in ensemble learning
ting predicate. For a numerical attribute X, splitting predi- (Caruana, Niculescu-Mizil, Crew & Ksikes, 2004). An-
cates are usually of the form X c, where c is a constant. other recent trend is the construction of data-mining
For example, in the tree shown in Figure 1, the splitting models of high-speed data streams, and there have been
predicate of the root node is of this form. For a categorical adaptations of decision tree construction algorithms to
attribute X, splits are usually of the form X in C, where C such environments (Domingos & Hulten, 2002). A last
is a set of values in the domain of X. For example, in the