You are on page 1of 176


St. Francis Institute of Management and Research
Mount Poinsur, S.V.P Road, Borivali (West) Mumbai-400103

pg. 1

Mount Poinsur, S.V.P Road, Borivali (West),Mumbai-400103.`

St. Francis Institute of Management and Research.

Winter Project (Information technology studies) Report Title:
Prepared for Mumbai University in the partial fulfillment of the requirement for the award of the degree in

SUBMITTED BY Patel Pramod Rameshchandra Roll No: 38 Year: 2009-11 Under the Guidance of Prof. Manoj Mathew.

pg. 2

St. Francis Institute of Management and Research Certificate Of Merit

This is certify that the work entered in this project is the work of an individual Mr.Patel Pramod Rameshchandra. Roll No: 38 MMS-II Has worked for the Semester IV of the year 2010-2011 in the college. Date :

pg. 3

[Internal in[ External in.charge] [College Stamp] [Director] charge] pg. 4 .

guiding and supporting us in all problems. Mr. 5 . Mrs. Soma L. pg. our Director Dr.Narinder Singh Kabo. Mr. Sherli Biju .Subandu K. I would like to thank. Maity.F Kumbar . Thomas Mathew. Joshua and my parents for helping. Mohini Ozarkar & Steve Halge our librarian. Thomas Mathew. Francis Institute of Management and Research for encouraging me in the development of this project.Sinimole. my internal project guide Prof. Patel . Durgesh Tanna.Manoj Mathew and our faculty coordinator Prof.Vaishali Kulkarni for all their help and co-operation. Miss Radhika S. Above this I would not like to miss this precious opportunity to thank. Miss Bhagyalaxmi Subramaniam. prof. prof. M. Miss Hiral Shah.Acknowledgment I would like to express my sincere gratitude toward the MBA department of the ST. Appaswamy. Mr. my friends. Miss Payal P.

6 .pg.

...2 Type chapter title (level 3)..................................................................4 Type chapter title (level 2).............5 Type chapter title (level 3)........................................................................................................................Table of Contents Type chapter title (level 1)....................................................................1 Type chapter title (level 2)............................................................3 Type chapter title (level 1)...6 pg......... 7 ................

In the design of the database. Classification is a well-established data mining task that has been extensively studied in the fields of statistics. Our main objective is to reduce the implementation overhead and the memory space required for storage when compared to the traditional databases. The design of the object-oriented database is done in such a well equipped manner that the design itself aids in efficient data mining. This study focuses on the design of an object-oriented database. The traditional Database Management Systems (DBMSs) have limitations when handling complex information and user defined data types which could be addressed by incorporating Object-oriented programming concepts into the existing databases. decision theory and machine learning literature. pg.Executive Summary Data mining is a process that uses a variety of data analysis tools to discover knowledge. With the popularity of object-oriented database systems in database applications. the object-oriented programming concepts namely inheritance and polymorphism are employed. it is important to study the data mining methods for object-oriented databases. through incorporation of object-oriented programming concepts into existing relational databases. 8 . patterns and relationships in data that may be used to make valid predictions.

This research work is not done in the intention to replace or duplicate the work that is being done by me but rather its outcome that can help to complement the Business Analyst pg. 9 .Purpose of the study The purpose of this is to find the effective way of data mining using Object Oriented Database and to improve CRM using Data Mining Significance of the study This work will help provide additional information for the database administrator who is engaged in the improvement the way of data mining from data warehouse and also the effective to handle data mining.

10 . data mining implementations of other companies were investigated though CRM magazine current year and last year. • To effective way of data mining to successes in CRM • To build profitable Customer Relationships with Data Mining pg. In this work. The data mining solution proposed in this study could help support a data mining process as well as contribute to build a smooth way of data handing within organization. • To hit upon effective way of memory saving data mining using Object oriented database. • To Study and Understand the Object oriented database • To design a simple Object oriented database • To do effective data mining in the designed Object oriented database. data warehouse and data mining.Objective of the project The general objective of this project is to investigate and recommend suitable way for data mining. In order to meet the general objective of this project the following key activities must be carried out: • To Study and Understand the Basic concept of database.

Suppliers Table.Limitation of the project This project does not focuses on the whole Database design it only focuses on three tables that is Customers Table. Employee Table but in real scenario there is not only three tables it has many numbers of tables in the database pg. 11 .

but the required supercomputers of the era priced AI out of the reach of virtually everyone else. This discipline. or AI. The longest of these three lines is classical statistics. The notable exceptions were certain AI concepts which were adopted by some high-end commercial products. 12 . AI found a few applications at the very high end scientific/government markets. Certainly. cluster analysis. within the heart of today's data mining tools and techniques. such as query optimization modules for Relational Database Management Systems (RDBMS). Because this approach requires vast computer processing power. Data mining's second longest family line is artificial intelligence. pg. Classical statistics embrace concepts such as regression analysis. and confidence intervals. classical statistical analysis plays a significant role.Need for study Data mining roots are traced back along three family lines. all of which are used to study data and data relationships. standard deviation. it was not practical until the early 1980s. These are the very building blocks with which more advanced statistical analyses are underpinned. which is built upon heuristics as opposed to statistics. as statistics are the foundation of most technologies on which data mining is built. Without statistics. standard distribution. when computers began to offer useful power at reasonable prices. there would be no data mining. discriminated analysis. attempts to apply human-thought-like processing to statistical problems. standard variance.

Machine learning could be considered an evolution of AI. is fundamentally the adaptation of machine learning techniques to business applications. These techniques are then used together to study data and find previously-hidden trends or patterns within. While AI was not a commercial success. able to take advantage of the ever-improving price/performance ratios offered by computers of the 80s and 90s. Machine learning. Data mining is finding increasing acceptance in science and business areas which need to analyze large amounts of data to discover trends which business analyst could not otherwise find. because it blends AI heuristics with advanced statistical analysis. which is more accurately described as the union of statistics and AI. AI. found more applications because the entry price was lower than AI. its techniques were largely coopted by machine learning. such that programs make different decisions based on the qualities of the studied data.The third family line of data mining is machine learning. Data mining. in many ways. Machine learning attempts to let computer programs learn about the data business analyst study. using statistics for fundamental concepts. and adding more advanced AI heuristics and algorithms to achieve its goals. 13 . and machine learning. pg. Data mining is best described as the union of historical and recent developments in statistics.

Exploratory research technique is used because a problem has not been clearly defined. Exploratory research provides insights into and comprehension of an issue or situation.Methodology This is a primary research. In this project I used exploratory research technique. 14 . pg. The secondary data is collected through reviewing magazine and articles. This research technique is used is closely related to tracking and is used in qualitative research projects. Internet is used as a source of most of the database relevant to the issues involved in the study. It should draw definitive conclusions only with extreme caution.

pg.Analysis The following are the major activities of this project: Task I – The Literature/ Computer Weekly Magazines/Articles Review To study the significance of having good Object-Oriented Database Design • Review the Literature/ Computer Monthly Newspapers\ CRN Magazines/Articles Review • Review other relevant data Mining ways of Object-Oriented Database • Task II – Problem Analysis This is the first and base stage of the project. An Object-Oriented Database design criterion is developed. 15 . Evaluate an effective way of data mining using object oriented database. and educational elements are identified and examined. Potential problem areas of designing the database are identified. • • • Information and data collected is analyzed. At this stage. Technological. Task III – Proposed Effective Way of Data Mining Using Object Oriented Database Propose an effective way of data mining in object oriented database. requirement elicitation is conducted. social. Alternatives are explored.

databases can be classified according to types of content: bibliographic. and updated. amounts. names. Record 4. File 3. Bits are used to build bytes. Some of the important concepts involved in the design and implementation of a Database Management System are discussed below. the hierarchy is as follows: 1. addresses and other identifiable items of data. Database 2. Character (byte) 6. and images. Starting from the highest level. Data files contain records that are made up of data elements and a database consists of files. numeric.Bit pg.” “A database is a collection of information that is organized so that it can easily be accessed. Data element 5.Introduction to Database The Database Management System A Database Management System is a collection of software tools intended for the purpose of efficient storage and retrieval of data in a computer system. a binary element with the values 0 and 1. which are used to build data elements. quantities. managed. The smallest component of data in a computer is the bit. 16 . In one view. The Database “A database is an integrated collection of automated data files related to one another in the support of a common purpose. dates.” Each file in a database is made up of data elements – numbers. full-text.

” The data element dictionary is central to the application of the database management tools. character strings. The DBMS constantly refers to this Data Element Dictionary for interpreting the data stored in the database. A data value is the information stored in a data element. pg. date and time. data types and lengths of every data element in the subject database. The data element has functional relevance to the application being supported by the database. alphanumeric. which is the description of the database.The Data Element A data element is a place in a file used to store an item of information that is uniquely identifiable by its purpose and contents. It forms the basic database schema or the meta-data. Examples of common data element types supported are numeric. there are a variety of data types that are supported. The Data Element Types Relevant to the database management system. The Data Element Dictionary “A data element dictionary is a table of data elements including at least the names. 17 .

Interfile relationships are based on the functional relationships of their purposes. A file is collection of records. 18 .” The organization of the file provides functional storage of data. The records are alike in format but each record is unique in content.Files A database contains a set of files related to one another by a common purpose. therefore the records in a file have the same data elements but different data element values. related to the purpose of the system that the data base supports. pg. “A file is a set of records where the records have the same data elements in the same format.

the key data elements used for record identification. The Key Data Elements “The primary key data element in a file is the data element used to uniquely describe and locate a desired record. 19 . and the relationships between files. A file key logically points to the record that it indexes pg.Database Schemas “A schema is the expression of the data base in terms of the files it stores. The key can be a combination of more than one data element.” The translation of a schema into a data base management software system usually involves using a language to describe the schema to the data base management system.” The definition of the file includes the specification of the data element or elements that are the key to the file. the data elements in each file.

An entity in A is associated with at most one entity in B. however. Many to many. the database management system may or may not enforce data integrity called referential integrity. 20 . • One to one. Many to one. express the number of entities to which another entity can be associated via a relationship set. can be associated with any number (zero or more) of entities in A. One to many. An entity in A is associated with any number (zero or more) of entities in B. it is possible to relate one file to another in one of the following three ways: • One to one • Many to one • One to many • Many to many In such interfile relationships. Mapping Cardinalities. however. and an entity in B is associated with at most one entity in A. or cardinality ratios. An entity in B.An Interfile Relationship In a database. and an entity in B is associated with any number (zero or more) of entities in A. An entity in A is associated with at most one entity in B. can be associated with at most one entity in A. • • • pg. An entity in B. An entity in A is associated with any number (zero or more) of entities in B. although Entity can contribute to the description of relationship sets that involve more than two entity sets. Mapping cardinalities are most useful in describing binary relationship sets.

UPDATE. Files are unrelated. The DDL describes the records to the application programs and the DML provides an interface to the DBMS. Network Data Model: This model is similar to hierarchical model except that a file can have multiple parents. {SELECT. the files have no parents and no children. DROP} The Data Manipulation Language The Data Definition Language is used to describe the database to the DBMS. DELETE} pg. there is a need for a corresponding language for programs to use to communicate with the DBMS. Relational Data Model: Here. The Data Definition Language (DDL) is used for such a specification. Such a language is called the Data Manipulation Language (DML). Here the relationships are explicitly defined by the user and maintained internally by the database • • The Data Definition Language The format of the database and the format of the tables must be in a format that the computer can translate into the actual physical storage characteristics for the data. 21 . The first used the record format and the second uses the external function calls. INSERT. ALTER. {CREATE.The Data Models The data in a database may be organized in 3 principal models: • Hierarchical Data Model: The relationships between the files form a hierarchy.

The Query Language The Query Language is used primarily for the process of retrieval of data stored in a database. Figure 1: The Database System pg. 22 . This data is retrieved by issuing query commands to DBMS. which in turn interprets and appropriately processes them.

" “A single. 23 . Data warehousing emphasizes the capture of data from diverse sources for useful analysis and access. but does not generally start from the point-of-view of the end user or knowledge worker who may need access to specialized. sometimes local databases. complete and consistent store of data obtained from a variety of different sources made available to end users in a what customer can understand and use in a business context. IBM sometimes uses the term "information warehouse. and decision support systems (DSS).” The term was coined by W. Data from various online transaction processing (OLTP) applications and other sources is selectively extracted and organized on the data warehouse database for use by analytical applications and user queries.” -Barry Devlin Typically. Web Mining. Applications of data warehouses include data mining. a data warehouse is housed on an enterprise mainframe server.Introduction to Data Warehouse and Data Mining The Data Warehouse “A data warehouse is a central repository for all or significant parts of the data that an enterprise's various business systems collect. pg. Inmon. H. The latter idea is known as the data mart.

a type of data mining used in customer relationship management (CRM).” “Looking for the hidden patterns and trends in data that is not immediately apparent from summarizing the data” Data mining parameters includes: • Association: Looking for patterns where one event is connected to another event Sequence or path analysis :Looking for patterns where one event leads to another later event Classification: Looking for new patterns (May result in a change in the way the data is organized) Clustering: Finding and visually documenting groups of facts not previously known Forecasting: Discovering patterns in data that can lead to reasonable predictions about the future (This area of data mining is known as predictive analytics. cybernetics. pg. genetics and marketing.) • • • • Data mining techniques are used in a many research areas.The Data Mining “Data mining is sorting through data to identify patterns and establish relationships. Web mining. including mathematics. 24 . takes advantage of the huge amount of information gathered by a Web site to look for patterns in user behavior.

to satellite pictures. Unfortunately. satellites. Organizations have been collecting tremendous amounts of information. because we believe that information leads to” power and success”. 25 . and the discovery of patterns in raw data. Information retrieval is simply not enough anymore for decisionmaking. Organizations started collecting and storing all sorts of data. counting on the power of computers to help sort through this amalgam of information.We are in an age often referred to as the “information age”. Confronted with huge collections of data. The proliferation of Database Management Systems has also contributed to recent massive gathering of all sorts of information. pg. etc. The efficient Database Management Systems have been very important assets for management of a large corpus of data and especially for effective and efficient retrieval of particular information from a large collection whenever needed. with the advent of computers and means for mass digital storage. these massive collections of data stored on disparate structures very rapidly became overwhelming. and thanks to sophisticated technologies such as computers. extraction of the “essence” of information stored. Initially. text reports and military intelligence. These needs are automatic summarization of data. Organizations have now created new needs to help us make better managerial choices. In this information age.. Organizations have far more information than Organizations can handle: from business transactions and scientific data. This initial chaos has led to the creation of structured databases and Database Management Systems (DBMS). Today.

. Here is a non-exclusive list of a variety of information collected in digital form in databases and in flat files. to more complex information such as spatial data. Large department stores. in the Canadian forest studying readings from a grizzly bear radio collar. Storage space is not the major problem. our society is amassing colossal amounts of scientific data that need to be analyzed. from simple numerical measurements and text documents. • Business Transactions: Every transaction in the business industry is (often) “memorized” for perpetuity. as the price of hard disks is continuously dropping. on a South Pole iceberg gathering data about oceanic activity. banking. For example. pg. stock. multimedia channels. or in an American university investigating human psychology.What kind of Information Data Mining is collecting? Organization have been collecting a myriad of data. etc. Such transactions are usually time related and can be interbusiness deals such as purchases. 26 . • Scientific Data: Whether in a Swiss nuclear accelerator laboratory counting particles. thanks to the widespread use of bar codes. Unfortunately. or intra-business operations such as management of inhouse wares and assets. and hypertext documents. exchanges. but the effective use of the data in a reasonable time frame for competitive decision making is definitely the most important problem to solve for businesses that struggle to survive in a highly competitive world. store millions of transactions daily representing often terabytes of data. Organizations can capture and store more new data faster than Organizations can analyze the old data already accumulated.

to swimming times. players and athletes. all the data are stored. basketball passes and car-racing lapses. Commentators and journalists are using this information for reporting. Satellite sensing: There are a countless number of satellites around the globe: some are geo-stationary above a region. However. better understand a market. very large collections of information are continuously gathered about individuals and groups. but trainers and athletes would want to exploit this data to improve performance and better understand opponents. this information is collected. When correlated with other data this information can shed light on customer behaviors and the like. Governments. Video tapes from surveillance cameras are usually recycled and thus the content is lost. there is a tendency today to store the tapes and even digitize them for future use and analysis. companies and organizations such as hospitals. but all are sending a non-stop stream of data to the surface. From hockey scores. which controls a large number of satellites. are stockpiling very important quantities of personal data to help them manage human resources.• Medical and Personal Data: From government census to personnel and customer files. and some are orbiting around the Earth. Many satellite pictures and data are made public as soon as satellite sensing are received in the hopes that other researchers can analyze them. Surveillance Video and Pictures: With the amazing collapse of video camera prices. or simply assist clientele. boxer’s pushes and chess positions. used and even shared. 27 . NASA. receives more data every second than what all NASA researchers and engineers can cope with. Games: Our society is collecting a tremendous amount of data and statistics about games. • • • pg. video cameras are becoming ubiquitous. Regardless of the privacy issues this type of data often reveals.

28 . • • • pg. many radio stations.. These systems are generating a tremendous amount of data. function libraries. There is a remarkable amount of virtual reality object and space repositories available. Management of these repositories as well as content-based search and retrieval from these repositories are still research issues. Text Reports and Memos (E-mail Messages): Most of the communications within and between companies or research organizations or even private people. software engineering is a source of considerable similar data with code. Ideally. In addition. These messages are regularly stored in digital form for future use and reference creating formidable digital libraries. Virtual Worlds: There are many applications making use of three-dimensional virtual spaces. these virtual spaces are described in such a way that Virtual Worlds can share objects and places. desktop video cameras and digital cameras is one of the causes of the explosion in digital media repositories. which need powerful tools for management and maintenance.• Digital Media: The proliferation of cheap scanners. while the size of the collections continues to grow. television channels and film studios are digitizing their audio and video collections to improve the management of their multimedia assets. These spaces and the objects Virtual Worlds contain are described with special languages such as VRML. are based on reports and memos in textual forms often exchanged by e-mail. etc. Moreover. CAD and Software Engineering Data: There are a multitude of “Computer Assisted Design” (CAD) systems for architects to design buildings or engineers to conceive system components or circuits. objects.

documents of all sorts of formats. the World Wide Web is the most important data collection regularly used for reference because of the broad variety of topics covered and the infinite contributions of resources and publishers. content and description have been collected and inter-connected with hyperlinks making it the largest repository of data ever built. its heterogeneous characteristic.• The World Wide Web Repositories: Since the inception of the World Wide Web in 1993. Despite its dynamic and unstructured nature. Many believe that the World Wide Web will become the compilation of human knowledge. pg. and it’s very often redundancy and inconsistency. 29 .

databases. if not necessary. 30 . previously unknown and potentially useful information from data in databases. refers to the nontrivial extraction of implicit.What are Data Mining and Knowledge Discovery? With the enormous amount of data stored in files. it is increasingly important. While data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms. data mining is actually part of the knowledge discovery process. and other repositories. Data Mining. to develop powerful means for analysis and perhaps interpretation of such data and for the extraction of interesting knowledge that could help in decision-making. pg. also popularly known as “Knowledge Discovery in Databases” (KDD).

Figure2: Data Mining is the core of Knowledge Discovery Process The Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections to some form of new knowledge. 31 . pg.The following figure 2 shows data mining as a step in an iterative knowledge discovery process.

The iterative process consists of the following steps: • Data Cleaning: Also known as data cleansing. it is a phase in which the selected data is transformed into forms appropriate for the mining procedure. often heterogeneous. it is a phase in which noise data and irrelevant data are removed from the collection. data cleaning and data integration can be performed together as a pre-processing phase to generate a data warehouse. Data Selection: At this step. Data selection and data transformation can also be combined where the consolidation of the data is the result of the selection. 32 . or. may be combined in a common source. • • • • • • It is common to combine some of these steps together. multiple data sources. This essential step uses visualization techniques to help users understand and interpret the data mining results. Data Integration: At this stage. For instance. Pattern Evaluation: In this step. the selection is done on transformed data. as for the case of data warehouses. Data Transformation: Also known as data consolidation. pg. strictly interesting patterns representing knowledge are identified based on given measures. the data relevant to the analysis is decided on and retrieved from the data collection. Data Mining: It is the crucial step in which clever techniques are applied to extract patterns potentially useful. Knowledge Representation: It is the final phase in which the discovered knowledge is visually represented to the user.

33 .The KDD is an iterative process. Data mining derives its name from the similarities between searching for valuable information in a large database and mining rocks for a vein of valuable ore. pg. data mining should have been called “knowledge mining” instead. Nevertheless. It is. more appropriate results. Other similar terms referring to data mining are: data dredging. Both imply either sifting through a large amount of material or ingeniously probing the material to exactly pinpoint where the values reside. or new data sources can be integrated. Once the discovered knowledge is presented to the user. since mining for gold in rocks is usually called “gold mining” and not “rock mining”. in order to get different. however. knowledge extraction and pattern discovery. data mining became the accepted customary term. the mining can be further refined. the evaluation measures can be enhanced. and very rapidly a trend that even overshadowed more general terms such as knowledge discovery in databases (KDD) that describe a more complete process. new data can be selected or further transformed. a misnomer. thus by analogy.

object-relational databases and object oriented databases. including relational databases.What kind of Data can be mined? In principle. data mining is not specific to one type of media or data. transactional databases. data warehouses. time-series databases and textual databases. However. timeseries data. pg. scientific measurements. advanced databases such as spatial databases. multimedia databases. Flat files are simple data files in text or binary format with a structure known by the data mining algorithm to be applied. algorithms and approaches may differ when applied to different types of data. Data mining is being put into use and studied for databases. unstructured and semi structured repositories such as the World Wide Web. the challenges presented by different types of data vary significantly. Here are some examples in more detail: • Flat files: Flat files are actually the most common data source for data mining algorithms. 34 . and even flat files. etc. Data mining should be applicable to any kind of information repository. The data in these files can be transactions. Indeed. especially at the research level.

35 . where columns represent “Attributes” and rows represent “tuples”. a relational database consists of a set of tables containing either values of entity attributes. A tuple in a relational table corresponds to either an object or a relationship between objects and is identified by a set of attribute values representing a “unique key”. Tables have columns and rows. In Figure 3 I have present some relations Customer. and Borrow representing business activity in a fictitious video store VideoStore. These relations are just a subset of what could be a database for the video store and is given as an example. Figure 3: Fragments of some relations from a relational database for VideoStore pg. Items.• Relational Databases: Briefly. or values of attributes from entity relationships.

max and count. it goes beyond what SQL could provide. such as predicting. as well as the calculation of aggregate functions such as average. sum. comparing.The most commonly used query language for relational database is SQL. transformation and consolidation. While data mining can benefit from SQL for data selection. which allows retrieval and manipulation of the data stored in the tables. detecting deviations. For instance. since Relational Databases can take advantage of the structure inherent to relational databases. pg. min. an SQL query to select the videos grouped by category would be: “SELECT count (*) FROM Items WHERE type=video GROUP BY category.” Data mining algorithms using relational databases can be more versatile than data mining algorithms specifically written for flat files. etc. 36 .

Many video stores belonging to VideoStore Company may have different databases and different structures. 37 . data warehouses are usually modeled by a multi-dimensional data structure.• Data Warehouses: A data warehouse as a storehouse is a repository of data collected from multiple data sources (often heterogeneous) and is intended to be used as a whole under the same unified schema. Figure 4 shows an example of a three dimensional subset of a data cube structure used for VideoStore data warehouse. If the executive of the company wants to access the data from all stores for strategic decision-making. transformed and integrated together. cleaned. marketing. A data warehouse gives the option to analyze data from different sources under the same roof. etc. pg. future direction. Let us suppose that VideoStore becomes a franchise in New York. data from the different stores would be loaded. In other words. To facilitate decision making and multi-dimensional views.. it would be more appropriate to store all the data in one site with a homogeneous structure that allows interactive analysis.

The data cube gives the summarized rentals along three dimensions: category. pg. A cube contains cells that store values of some aggregate measures (in this case rental counts). and special cells that store summations along dimensions. 38 . then a cross table of summarized rentals by film categories and time (in quarters). time. and city. Each dimension of the data cube contains a hierarchy of values for one attribute.o Figure 4: A multi-dimensional data cube structure commonly used in data for data warehousing The figure shows summarized rentals grouped by film categories.

OLAP operations allow the navigation of data at different levels of abstraction. Data warehouses contain and the hierarchical attribute values of their dimensions. dice. data cubes are well suited for fast interactive querying and analysis of data at different conceptual levels. etc. Figure 5: Summarized data from VideoStore before and after drill-down and roll-up operations. such as drill-down.Because of their structure. pg. roll-up. 39 . Figure 5 illustrates the drilldown (on the time dimension) and roll-up (on the location dimension) operations. slice. the pre-computed summarized data. known as On-Line Analytical Processing (OLAP).

one for the transactions and one for the transaction items. an identifier and a set of items. One typical data mining analysis on such data is the so-called market basket analysis or association rules in which associations between items occurring together or in sequence are studied. Associated with the transaction files could also be descriptive data for the items. the rentals table such as shown in Figure 6 represents the transaction database. games. Figure 6: Fragment of a transaction database for the rentals at VideoStore pg. VCR.).e. and the list of items rented (i. a set as attribute value). etc. transactions are usually stored in flat files or stored in two normalized transaction tables. 40 . in the case of the video store. For example. a date. Since relational databases do not allow nested tables (i. each with a time stamp. video tapes. Each record is a rental contract with a customer identifier.• Transaction Databases: A transaction database is a set of records representing transactions.e.

and natural language processing methodologies. • Figure 7: Visualization of spatial OLAP (from GeoMiner system) pg. Such spatial databases present new challenges to data mining algorithms. store geographical information like maps. 41 . computer graphics. images. image interpretation. and audio and text media. Multimedia is characterized by its high dimensionality.• Multimedia Databases: Multimedia databases include video. Spatial Databases: Spatial databases are databases that. and global or regional positioning. Multimedia Databases can be stored on extended object-relational or object-oriented databases. Data mining from multimedia repositories may require computer vision. which makes data mining even more challenging. in addition to usual data. or simply on a file system.

which sometimes causes the need for a challenging real time analysis. Figure 8 shows some examples of time-series data. These databases usually have a continuous flow of new data coming in. Figure 8: Examples of Time-Series Data (Source: Thompson Investors Group) pg. as well as the prediction of trends and movements of the variables in time. Data mining in such databases commonly includes the study of trends and correlations between evolutions of different variables. 42 .• Time-Series Databases: Time-series databases contain time related data such stock market data or logged activities.

or web mining. audio. raw data. tries to address all these issues and is often divided into web content mining. 43 . pg. A very large number of authors and publishers are continuously contributing to its growth and metamorphosis. and a massive number of users are accessing its resources daily. which covers the hyperlinks and the relationships between documents. These documents can be text. Conceptually. and even applications.• World Wide Web: The World Wide Web is the most heterogeneous and dynamic repository available. video. which encompasses documents available. describing how and when the resources are accessed. the World Wide Web is comprised of three major components: The “Content of the Web”. The “Structure of the Web”. Data mining in the World Wide Web. and The “Usage of the web”. A fourth dimension can be added relating the dynamic nature or evolution of the documents. web structure mining and web usage mining. Data in the World Wide Web is organized in inter-connected documents.

With concept hierarchies on the attributes describing the target class. For example. The data relevant to a user-specified class are normally retrieved by a database query and run through a summarization module to extract the essence of the data at different levels of abstractions. and predictive data mining tasks that attempt to do predictions based on inference on available data. The data mining functionalities and the variety of knowledge data mining discover are briefly presented in the following list: • Characterization: Data characterization is a summarization of general features of objects in a target class. and produces what is called characteristic rules. pg. to carry out data summarization. 44 . there are two types of data mining tasks: descriptive data mining tasks that describe the general properties of the existing data.What can be discovered? The kinds of patterns that can be discovered depend upon the data mining tasks employed. the attribute oriented induction method can be used. one may want to characterize the VideoStore customers who regularly rent more than 30 movies a year. Note that with a data cube containing summarization of data. simple OLAP operations fit the purpose of data characterization. for example. By and large.

one may want to compare the general characteristics of the customers who rented more than 30 movies in the last year with those whose rental account is lower than 5. 45 . For example. pg.• Discrimination: Data discrimination produces what are called discriminate rules and is basically the comparison of the general features of objects between two classes referred to as the target class and the contrasting class. The techniques used for data discrimination are very similar to the techniques used for data characterization with the exception that data discrimination results include comparative measures.

The discovered association rules are of the form: P®Q [s. identifies the frequent item sets. Another threshold.• Association analysis: Association analysis is the discovery of what are commonly called association rules. and that there is a certainty of 55% that teenage customers who rent a game also buy pop. it could be useful for the OurVideoStore manager to know what movies are often rented together or if there is a relationship between renting a certain type of movies and buying popcorn or pop. and s (for support) is the probability that P and Q appear together in a transaction and c (for confidence) is the conditional probability that Q appears in a transaction when P is present. pg. “13-19”) ® Buys(X. For example. It studies the frequency of items occurring together in transactional databases. Association analysis is commonly used for market basket analysis. where P and Q are conjunctions of attribute value-pairs. is used to pinpoint association rules. c=55%]” Would indicate that 2% of the transactions considered are of customers aged between 13 and 19 who are renting a game and buying a pop. and based on a threshold called support. the hypothetic associations rule: “RentType(X. For example. which is the conditional probability than an item appears in a transaction when another item appears. confidence. “pop”) [s=2%. “game”) Ù Age(X.c]. 46 .

and label accordingly the customers who received credits with three possible labels “safe”. The classification analysis would generate a model that could be used to either accept or reject credit requests in the future. For example. 47 . the classification uses given class labels to order the objects in the data collection. Also known as supervised classification. pg. after starting a credit policy. the VideoStore managers could analyze the customers’ behaviors vis-à-vis their credit. “risky” and “very risky”. Classification approaches normally use a training set where all objects are already associated with known class labels.• Classification: Classification analysis is the organization of data in given classes. The classification algorithm learns from the training set and builds a model. The model is used to classify new objects.

Outlier analysis: Outliers are data elements that cannot be grouped in a given class or cluster.• Prediction: Prediction has attracted considerable attention given the potential implications of successful forecasting in a business context. However. clustering is the organization of data in classes. the class label of an object can be foreseen based on the attribute values of the object and the attribute values of the classes. There are two major types of predictions: one can either try to predict some unavailable data values or pending trends. Also known as exceptions or surprises. class labels are unknown and it is up to the clustering algorithm to discover acceptable classes. or predict a class label for some data. or increase/ decrease trends in time related data. Prediction is however more often referred to the forecast of missing numerical values. 48 . and thus can be very significant and their analysis valuable. unlike classification. Clustering is also called unsupervised classification. because the classification is not dictated by given class labels. While outliers can be considered noise and discarded in some applications. Outlier analyses are often very important to identify. • • pg. There are many clustering approaches all based on the principle of maximizing the similarity between objects in a same class (intra-class similarity) and minimizing the similarity between objects of different classes (inter-class similarity). Once a classification model is built based on a training set. The major idea is to use a large number of past values to consider probable future values. Clustering: Similar to classification. in clustering. Outlier analysis can reveal important knowledge in other domains. The latter is tied to classification.

on the other hand. It is therefore important to have a versatile and inclusive data mining system that allows the discovery of different kinds of knowledge and at different levels of abstraction. considers differences between measured values and expected values. and attempts to find the cause of the deviations from the anticipated values. which consent to characterizing. Deviation analysis.• Evolution and deviation analysis: Evolution and deviation analysis pertain to the study of time related data that changes in time. This also makes interactivity an important attribute of a data mining system. classifying or clustering of time related data. comparing. 49 . Evolution analysis models evolutionary trends in data. It is common that users do not have a clear idea of the kind of patterns organization can discover or need to discover from the data at hand. pg.

and interestingness refinement languages that interactively query the results for interesting patterns after the discovery phase. This brings the issue of describing what is interesting to discover. These thresholds define the completeness of the patterns discovered. a very large number of patterns or rules. useful or interesting. Identifying and measuring the interestingness of patterns and rules discovered. assessing the interestingness of discovered knowledge is still an important research issue. Discovered patterns can also be found interesting if business analyst confirm or validate a hypothesis sought to be confirmed or unexpectedly contradict a common belief. It is certain that data mining can generate. but only those that are interesting. one has to put a measurement on the patterns. The user would want to discover all rules or patterns. or on some subjective depictions such as understandability of the patterns. Whether the knowledge discovered is new. To reduce the number of patterns or rules discovered that have a high probability to be non-interesting. such as meta-rule guided discovery that describes forms of rules before the discovery process. Typically. measurements for interestingness are based on thresholds set by the user. One can even think of a meta-mining phase to mine the oversized data mining results. or discover. novelty of the patterns. can be based on quantifiable objective elements such as validity of the patterns when tested on new data with some degree of certainty. or usefulness. In some cases the number of rules can reach the millions. or to be discovered is essential for the evaluation of the mined knowledge and the KDD process as a whole. is very subjective and depends upon the application and the user. The measurement of how interesting a discovery is. pg. While some concrete measurements exist. However.Is all that is Discovered Interesting and Useful? Data mining allows the discovery of knowledge potentially useful and unknown. 50 . often called interestingness. this raises the problem of completeness.

multimedia data. text data. other are more versatile and comprehensive. such as characterization. Some are specialized systems dedicated to a given data source or are confined to limited data mining functionalities. World Wide Web. classification. data warehouse. 51 . etc. discrimination. etc. Data mining systems can be categorized according to various criteria among other classification are the following: • Classification according to the type of data source mined: This classification categorizes data mining systems according to the type of data handled such as spatial data.How do we Categorize Data Mining Systems? There are many data mining systems available or being developed. clustering. transactional. Some systems tend to be comprehensive systems offering several data mining functionalities together. • pg. • • Classification according to the data model drawn on: This classification categorizes data mining systems based on the data model involved such as relational database. etc. association. objectoriented database. Classification according to the king of knowledge discovered: This classification categorizes data mining systems based on the kind of knowledge discovered or data mining functionalities. time-series data.

• Classification according to mining techniques used: Data mining systems employ and provide different techniques. 52 . genetic algorithms. database oriented or data warehouseoriented. neural networks. visualization. and offer different degrees of user interaction. or autonomous systems. statistics. interactive exploratory systems. A comprehensive system would provide a wide variety of data mining techniques to fit different situations and options. pg. This classification categorizes data mining systems according to the data analysis approach used such as machine learning. The classification can also take into account the degree of user interaction involved in the data mining process such as query-driven systems. etc.

Note that these issues are not exclusive and are not ordered in any way. Some of these issues are addressed below. Before data mining develops into a conventional.What are the Issues in Data Mining? Data mining algorithms embody techniques that have sometimes existed for many years. While data mining is still in its infancy. databases of all sorts of content are regularly sold. • Security and Social Issues: Security is an important issue with any data collection that is shared and/or is intended to be used for strategic decision-making. and because of the competitive advantage that can be attained from implicit knowledge discovered. Another issue that arises from this concern is the appropriate use of data mining.. correlating personal data with other information. while other information could be widely distributed and used without control. large amounts of sensitive and private information about individuals or companies is gathered and stored. some important information could be withheld. Moreover. especially if there is potential dissemination of discovered information. but have only lately been applied as reliable and scalable tools that time and again outperform older classical statistical methods. user behavior understanding. pg. In addition. it is becoming a trend and ubiquitous. when data is collected for customer profiling. Due to the value of data. many still pending issues have to be addressed. etc. 53 . data mining could disclose new implicit knowledge about individuals or groups that could be against privacy policies. mature and trusted discipline. This becomes controversial given the confidential nature of some of this data and the potential illegal access to the information.

Interactivity with the data and data mining results is crucial since it provides means for the user to focus and refine the mining tasks. However. and above all understandable by the user. as well as to picture the discovered knowledge from different angles and at different conceptual levels. Good data visualization eases the interpretation of data mining results. and interaction. as well as helps users better understand their needs. 54 . pg. Many data exploratory analysis tasks are significantly facilitated by the ability to see data in an appropriate visual presentation. There are many visualization ideas and proposals for effective data graphical presentation.• User Interface Issues: The knowledge discovered by data mining tools is useful as long as it is interesting. there is still much research to accomplish in order to obtain good visualization tools for large datasets that could be used to display and manipulate mined knowledge. information rendering. The major issues related to user interfaces and visualization is “screen realestate”.

the size of the search space is even more decisive for data mining techniques. is one of the most important phases in the knowledge discovery process. Topics such as versatility of the mining approaches. it is often desirable to have different data mining methods available since different approaches may perform differently depending upon the data at hand. Data mining techniques should be able to handle noise in data or incomplete information. invalid or incomplete information. Most algorithms assume the data to be noise-free. pg. the exploitation of background knowledge and metadata. This is of course strong assumption. More than the size of data. Moreover. if not obscure. data preprocessing (data cleaning and transformation) becomes vital.• Mining Methodology Issues: These issues pertain to the data mining approaches applied and their limitations. are all examples that can dictate mining methodology choices. etc. the analysis process and in many cases compromise the accuracy of the results. As a consequence. the assessment of the knowledge discovered. the diversity of data available. etc. It is often seen as lost time.. the broad analysis needs (when known). 55 . Most datasets contain exceptions. This “curse” affects so badly the performance of some data mining approaches that it is becoming one of the most urgent issues to solve. as time consuming and frustrating as it may be. The size of the search space is often depending upon the number of dimensions in the domain space. This is known as the curse of dimensionality. which may complicate. but data cleaning. For instance. the control and handling of noise in data. the dimensionality of the domain. different approaches may suit and solve user’s needs differently. The search space usually grows exponentially when the number of dimensions increases.

In same theme. However. sampling can be used for mining instead of the whole dataset. Linear algorithms are usually the norm. 56 . pg. Terabyte sizes are common. Incremental updating is important for merging results from parallel mining. and parallel programming. However. Other topics in the issue of performance are incremental updating. or updating data mining results when new data becomes available without having to re-analyze the complete dataset. concerns such as completeness and choice of samples may arise.• Performance Issues: Many artificial intelligence and statistical methods exist for data analysis and interpretation. There is no doubt that parallelism can help solve the size problem if the dataset can be subdivided and the results can be merged later. these methods were often not designed for the very large data sets data mining is dealing with today. This raises the issues of scalability and efficiency of the data mining methods when processing considerably large data. Algorithms with exponential and even medium-order polynomial complexity cannot be of practical use for data mining.

Regarding the practical issues related to data sources. A versatile data mining tool. the advent of data mining is certainly encouraging more data harvesting. for all sorts of data. 57 . pg. or try to process it. but other approaches need to be pioneered for other specific complex data types. some are practical such as the diversity of data types. The concern is whether Organizations are collecting the right data at the appropriate amount. the proliferation of heterogeneous data sources. there is the subject of heterogeneous databases and the focus on diverse complex data types.• Data Source Issues: There are many issues related to the data sources. It is difficult to expect a data mining system to effectively and efficiently achieve good mining results on all kinds of data and sources. there is a focus on relational databases and data warehouses. at structural and semantic levels. whether Organizations know what the business want to do with it. We certainly have an excess of data since we already have more data than Organizations can handle and Organizations are still collecting data at an even higher rate. Currently. while others are philosophical like the data glut problem. Different kinds of data and sources may require distinct algorithms and methodologies. later. Organizations are storing different types of data in a variety of repositories. If the spread of Database Management Systems has helped increase the gathering of information. may not be realistic. poses important challenges not only to the database community but also to the data mining community. Moreover. and whether Organizations distinguish between what data is important and what data is insignificant. The current practice is to collect as much data as possible now and process it.

That way it is possible to use a lot more data before the tool crashes. The Business Analyst prefer to make the samples itself and feed one at the time to the algorithm. A second advantage of making the samples by Business Analyst itself is that marketer can chose to generate non-overlapping samples as much as possible. The averaging can be done afterward. Instead of feeding one data file to the algorithm and let it do the sampling. This implies that Business Analyst can measure the quality of the models. the better the model created. 58 . The best models are “Ensembles” of weak learners. The more data in use. only one quality measure really matters (whatever lift-adepts and AUC-adepts may tell Business Analyst: what’s in it for their business Is It Money? Following are some points to remember for Effective Data Mining Models • There is no data like more data  Observations Push the data mining tool to the Maximum limits.Better Data Mining Models. For each individual model Business Analyst push it to its maximum limits. Business Analyst know. like bagging. learning. That way the total number of different observations used in model building reaches much higher levels than by feeding only one file to the modeling tool. pg. averaging.

This is fairly easy. divide. subtract. • Find the best algorithm It tempting to state that probably for each problem there is one best algorithm. 59 . it should have some business meaning Find additional information. according to their taste. inside or outside the company. numbers. So all data miner have to do is to try a handful of really different algorithms to find out which one is the best for the problem. experience. add. Variables Calculate additional (derived) fields. pg. mood. Different data miners will use the same algorithm differently. preference So find out which algorithm works best for Data Miner and their business problem. Business Analyst can multiply.

60 . and data miner can use it. The business targets the past buyers of XYZ in response to the business mailing. • Make it simple Nevertheless. the only thing data miner can do is use samples.• Zoom in on the business targets When data miners want to use a data mining model to select the customers who are most likely to buy the business outstanding product XYZ. Data Miner get a model with an excellent lift and use it for a mailing. Calculate the model. Let’s say the top-10%. With this new model. When the mailing campaign is over. Use them to calculate a new. Use the model to score the entire customer base. second model which will use the far more tiny differences in customer information to find the really promising ones. pg. but also their willingness to respond to the customer mailing If the databases contain far more observations than the data mining tool likes. better. because the business who pays the bills wants data miner to deliver good models. And now zoom in on the customers with the best scores. model for product XYZ. data miner will not only take their “natural” propensity to buy into account. But data miner can push it a bit further. it is reasonable to use the business past buyers of XYZ as the positive targets in the model. on time for his campaigns. data miner now have all the data company need to create a new. data miner have to keep business data mining work as simple as possible.

which is very similar to A. should equally be tackled with algorithm X. pg. No need to waste time checking out other algorithms. If problem A was best solved with algorithm X. than probably problem B. 61 .• Automate as much as possible The data miner should not to try out every possible algorithm in each data mining project.

The data stored in these databases possess valuable hidden knowledge. data mining tasks can be classified into two categories: Descriptive mining and Predictive mining. among others. In general. 62 . The predictive mining techniques consist of a series of tasks namely Classification. One of the important tasks of Data Mining is Data Classification which is the process of finding a valuable set of models that are self-descriptive and distinguishable data classes or concepts. Regression and Deviation detection. Data mining has made its impact on many applications such as marketing. Web mining. The discovery of such knowledge can be very fruitful for taking effective decisions. engineering. is the use of pattern recognition technologies with statistical and mathematical techniques for discovering meaningful new correlations. “Predictive Mining” is the process of deriving hidden patterns and trends from data to make predictions. to predict the set of classes with an unknown class label. Association Rule Mining and Sequential mining. Thus the need for developing methods for extracting knowledge from data is quite evident.Introduction to Object-Oriented Database In the modern computing world. expert prediction. and mobile computing. pg. “Descriptive Mining” is the process of extracting vital characteristics of data from databases. medicine. a promising approach to knowledge discovery. patterns and trends by analyzing large amounts of data stored in repositories. the amount of data generated and stored in databases of organizations is vast and continuing to grow at a rapid pace. customer relationship management. Data mining. crime analysis. Some of descriptive mining techniques are Clustering.

The Relational Databases are based on tables which are static components of organizational information. So many techniques such as decision trees. object oriented models capture the semantics and complexity of the data. the Relational Databases technology fails to handle the needs of complex information systems. In addition to this. Relational Databases can handle only simple predefined data types and faces problems when dealt with complex data types. Based on the concept of abstraction and generalization. retrieval and processing. Object-Oriented Databases (OODB) solves many of these problems. neural networks. Classification helps in credit approval. all highways with the same structural and behavioral properties can be classified as a class highway. in the transportation network. user defined data types and multimedia. integrating different data-types and storing the discovered knowledge. . 63 . From the application point of view. pg. Relational databases (RDB) has been the accepted solution for efficient storage and retrieval of huge volumes of data. Database systems offer a uniform framework for Data Mining by proficiently administering large datasets. and medical diagnosis. Often the semantics of relational databases are left unexplored within many relationships which cannot be extracted without users’ help. many research organizations are employing Object-Oriented Database (OODB) to solve their problems of data storage.For example. nearest neighbor methods and rough set-based methods enable the creation of classification models. Therefore. this technology still to be a niche technology unless an effort is taken to integrate Data Mining technology with traditional database system. Regardless of the potential effectiveness of Data Mining to appreciably enhance data analysis. product marketing. For over a decade. Thus.

the existing ObjectOriented Database Management System (OODBMS) technologies are not efficient enough to compete in the market with their relational counterparts. pg. 64 . It's difficult. thereby exploiting the features of RDBMSs and Object-Oriented (OO) concepts. . For instance. an object of a sub class can also be thought of as an object of its super class.An Object-Oriented Database (OODB) is a database in which the concepts of object-oriented languages are utilized. Apart from that. Undoubtedly. But in the current scenario. since a car is a vehicle. if not impossible. Database Administor intend to incorporate the object-oriented concepts into the existing Relational Database Management Systems (RDBMSs). In an “is -a” relationship. In a "has-a" relationship which is also known as “Composition” . one of the significant characteristic of object-oriented programming is inheritance. a class object holds one or more object references as data members. Hence. The principal strength of Object-Oriented database (OODB) is its ability to handle applications involving complex and interrelated information. to move off those Relational Databases. a class named Car exhibits an "is a" relationship with a base class named Vehicle. there are numerous applications built on existing Relational Database Management Systems (RDBMS). “Inheritance” is the concept by which the variables and methods defined in the parent class (super class) are automatically inherited by its child class (sub class). There are two ways to represent class relationships in object-oriented programming and objects are "is a" and "has a" relationships.

since apples are fruits (i. In generalization hierarchies. a wheel has spokes. a "fruit" is a generalization of "apple". orange.. the data members and methods of the super class are inherited by the subclasses and the objects of the subclass can use up those common properties of the super class without redefinition. Conversely. because the “is-a” relationship represents a hierarchy between the classes of objects. a bicycle has a steering wheel and. etc. one can consider fruit to be an abstraction of apple. an apple is-a fruit). "mango" and many others. because it reduces redundancy and maintains “Integrity”.For Example. in the same way. Inheritance can also be stated as generalization. Similarly.e. pg. 65 . "orange". For example. apples are bound to contain all the attributes common for a fruit. This concept of generalization is very powerful.

“Dynamic Polymorphism” is achieved by employing inheritance and virtual functions. polymorphism can be classified into three different kinds namely: pure. pg. Many Implementations". Polymorphism has a number of advantages. It is a general term which stands for ‘Many forms’. static. an instance of one class for another instance of a class that has the same “Polymorphic Interface”. • “Pure Polymorphism” refers to a function which can take parameters of several data types. “Static Polymorphism” can be stated as functions and operators overloading. In literature. • • Dynamic binding or runtime binding allows one to substitute polymorphic objects for each other at run-time.“Polymorphism” is another important Object oriented programming concept. Polymorphism in brief can be defined as "One Interface. and dynamic. It is a property of being able to assign a different meaning or usage to something in different contexts in particular. 66 . a function. or an object to take more than one form. to allow an entity such as a variable. Its chief advantage is that it simplifies the definition of clients. as it allows the client to substitute at run-time. Polymorphism is different from Method Overloading or Method Overriding.

I have utilized the objectoriented programming concepts: inheritance and polymorphism to achieve the above stated goals. portrays the efficiency of the proposed approach. Polymorphism is utilized to achieve better classification. 67 . The design of the Object-Oriented Database (OODB) is carried out in a expert mode with the intention of achieving efficient classification on the database. Chiefly. Polymorphism enables the usage of simple SQL queries to classify the designed Object-Oriented Database (OODB). The inheritance or the class relationships namely {“is-a” and “has. Another object. A novel and innovative approach for the design of an objectoriented database is presented in my study. are still not defined for Object-Oriented Database Management Systems (OODBMSs) as those for relational Database Management Systems (DBMSs) and as most organizations have their information systems based on a relational Database Management Systems (DBMS) technology. the incorporation of the object oriented programming concepts into the existing Relational Database Management Systems (RDBMSs) will be an ideal choice to design a database that best suit the advanced database applications. I have extended the existing relational databases by incorporating the object-oriented programming concepts. because object-oriented database systems have emerged as a popular and influential setting in advanced database applications. In my proposed approach. pg.oriented programming concept. to attain an object-oriented database. The designed Object-Oriented Database (OODB) demands less implementation overhead and saves considerable memory space compared to Relational Databases (RDBs) while exploiting its essential features.It is becoming increasingly important to extend the domain of study from relational database systems to object-oriented database systems and probe the knowledge discovery mechanisms in objectoriented databases.a”} are used to represent a class hierarchy in the proposed Object-Oriented Database (OODB). The fact that standards. The experimental results stated. The Object-Oriented Database (OODB) is structured mainly by employing the class hierarchies of inheritance.

Object-Oriented Database (OODB) The chief advantage of Object-Oriented Database (OODB) is its ability to represent real world concepts as data models in an effective and presentable manner. Object-Oriented Database (OODB) is optimized to support object-oriented applications, different types of structures including trees, composite objects and complex data relationships. The Object-Oriented Database (OODB) system handles complex databases efficiently and it allows the users to define a database, with features for creating, altering, and dropping tables and establishing constraints. From the user’s perception, Object-Oriented Database (OODB) is just a collection of objects and inter-relationships among objects . Those objects that resemble in properties and behavior are organized into classes. Every class is a container of a set of common attributes and methods shared by similar objects.

The “Attributes or Instance “Properties of a Class”.




The “Method” describes the “Behavior of the Objects associated with the Class”. A “Class/Subclass Hierarchy” is used to represents “Complex Objects where Attributes of an Object itself contains Complex Objects”.

pg. 68

The most important Object-Oriented Concept employed in an Object-Oriented Database (OODB) model includes the inheritance mechanism and composite object modeling. In order to cope with the increased complexity of the object-oriented model, one can divide class features as follows: simple attributes - attributes with scalar types; complex attributes - attributes with complex types, simple methods - methods accessing only local class simple attributes; complex methods - methods that return or refer instances of other classes . The object-oriented approach uses two important abstraction principles for structuring designs: Classification and Generalization. “Classification” is defined as, “An abstraction principle by which objects with similar properties are grouped into classes defining the structure and behavior of their instances.” “Generalization” is defined as, “An abstraction principle by which all the common properties shared by several classes are organized into a single super class to form a Class Hierarchy. From the very outset of the first Object-Oriented Database Management Systems (OODBMS) Gemstone in the mid-eighties, a dozen other commercial Object-Oriented Database Management Systems (OODBMSs) have joined the fierce competition in the market. Regarding the applications of Object-Oriented Database (OODB), its vendors have laid their focus on Computer Aided Design (CAD), Computer Aided Manufacturing (CAM) and Computer Aided Software Engineering (CASE). All these user applications are meant to handle complex information and the Object-Oriented Database (OODB) systems promises to propose efficient solutions to these problems. Factory and office automation are other application areas of object-oriented database technology.

pg. 69

New Approach to the Design of Object Oriented Database In general computer literature, defines three approaches to build an Object-Oriented Database Management Systems (OODBMS) extending an Object-Oriented Programming Language (OOPL), extending a Relational Database Management System (RDBMS), and starting from scratch. The “First” approach develops an Object-Oriented Database Management System (ODBMS) by encompassing to an ObjectOriented Programming Language (OOPL) persistent storage to achieve multiple concurrent accesses with transaction support. The “Second” is an extended relational approach; an ObjectOriented Database Management Systems (OODBMS) is built by incorporating an existing Relational Database Management Systems (RDBMS) with Object-Oriented features such as classes and inheritances, methods and encapsulations, polymorphism and complex objects. The “Third” approach aims to revolutionize the database technology in the sense that an Object-Oriented Database Management Systems (OODBMS) is designed from the ground up, as represented by UniSQL / UniOracle and OpenOODB (Open Object-Oriented Database) . In my design, I have employed the second approach which extends the Relational Databases by utilizing the Object-Oriented Programming (OOP) concepts.

pg. 70

The use of these object-oriented concepts for the design of Object-Oriented Database (OODB) Object-Oriented Database ensures that even complex queries can be answered more efficiently. To portray the efficiency of my proposed approach. Let T denote a set of all tables on a database D and t subset T. classification can be achieved in an effective manner. Next I have employed another important object-oriented characteristic dynamic polymorphism. where ‘t’ represents the set of tables in which some fields are in common. I consider a traditional table. Now I have create a generalized table composing of all those common fields from the table set‘t’. Table 3 respectively pg.” The polymorphism is specifically employed to achieve classification in a simple and effective manner. Hence when I have consider a database. A traditional example of the database for large business organizations will have a number of tables but to best illustrate the ObjectOriented Programming (OOP) concepts employed in my approach. performing different operations based on the “Calling Object. This ability to represent classes in hierarchy is one of the eminent Object-Oriented Programming (OOP) concepts. database is a collection of tables. it is bound to contain a number of tables with common fields. Particularly the data mining task.The proposed approach makes use of the Object-Oriented Programming (OOP) concepts namely. I have grouped together such common set of fields to form a single generalized table. Table 2. Suppliers and Customers. The tables are represented as Table 1. I have concentrated on three tables namely. In my approach. Normally. where different classes have methods of the same name and structure. 71 . The newly created table resembles the base class in the inheritance hierarchy. Employees.” Inheritance and Polymorphism “to design an Object-Oriented Database (OODB) and perform classification in it respectively.

pg. 72

Table 1: Example of Employees Table

pg. 73

Table 2: Example of Customers Table

pg. 74

Table 3: Example of Suppliers Table pg. 75 .

The above set of tables namely Employees. Suppliers and Customers Table pg. 76 . Suppliers and Customers can be represented equivalently as classes. The class structure may look like as in Figure 9 Figure 9: Class Structure of Employees.

it has general fields like Name. if a query is given to retrieve a set of records for the whole organization satisfying a particular rule. pg. it is understood that every table has a set of general or common fields (highlighted ones) and tablespecific fields. this replication of general fields in the table leads to a poor design which affects effective data classification. This causes redundancy and thereby increases space complexity. Moreover. HireDate etc. These general fields occur repeatedly in most tables. Age. To perform better classification. 77 . there may be a need to search all the tables separately. I have design an Object-Oriented Database (OODB) by incorporating the inheritance concept of Object-Oriented Programming (OOP). Gender etc.From the above class structure. and table-specific fields like Title. So. On considering the Employee table.

I have located all the general or common fields from the table set‘t’. Then the table Person is said to have a “has-a” relationship with the table Places i.e. In the following pictured design. Figure 10 represents the inheritance class hierarchy of the proposed (OODB) ObjectOriented Database design. the small triangle (→) represents “is-a” relationship and the arrow (→) represents “has-a” relationship. I have design an Object-Oriented Database (OODB) by utilizing the inheritance concept of ObjectOriented Programming (OOP) by which will eliminate the problem of redundancy. Both these relationships can be best illustrated as below: The generalized table “Person” contains all the common fields and the tables “Employees. A Supplier is a Person and A Customer is a Person. I have created a new table called ‘Person’. First. Customers inherit the Person table without redefining it. Similarly to exemplify the composition relation. all these general or common fields are fetched and stored in a single table and all the related tables can inherit it. Design of the Object-Oriented Database First in my proposed approach. Then. A Place has a Postal Code. 78 .e. In my approach. Thus the Generalized table resembles the base class of the Object-Oriented Programming (OOP) paradigm. a Person has a place and similarly. Generalization depicts an “is-a” relation and composition represents a “has-a” relation. Suppliers and Customers” inheriting the Table “Person” is said to have an “is-a” relationship with the table Person i. an Employee is a Person.. the table Person contains an object reference of the “Places” Table as its field. pg. I have used two important Mechanisms namely “Generalization” and “Composition”. which contains all those common fields and the other tables like Employees. Here..

pg. 79 .

80 .Figure 10: Inheritance Hierarchy of Classes in the Proposed OODB Design pg.

pg. 81 .

In addition. the base class ‘Person’. the generalized class ‘Person’ exhibits composition relationship with another two classes ‘Places’ and ‘PostalCodes’. Suppliers and Customers. which contains all the common attributes. “Moreover. Therefore. The tables in the proposed (OODB) design are shown in Tables.The generalized table ‘Person’ is considered as the base class ‘Person’ and the fields are considered as the attributes of the base class ‘Person’. if there is a need to get the contact numbers of all the people associated with the organization. pg. 82 . is inherited by the other classes namely Employees. The class ‘Person’ uses instance variables. which are object references of the classes ‘Places’ and ‘PostalCodes’. which contain only the specialized attributes. For example. can define a method getContactNumebrs() in the base class ‘Person’ and it can be shared by its subclasses. inheritance allows me to define the generalized methods in the base class and specialized methods in the sub classes”.

pg. 83 .

Table 4: Example of Persons Table pg. 84 .

Table 5: Example of Extended Employees Table pg. 85 .

86 .Table 6: Example of Extended Suppliers Table pg.

Table 7: Example of Extended Customers Table pg. 87 .

88 .Table 8: Example of Extended Places Table pg.

89 .Table 9: Example of Extended PostalCodes Table pg.

pg. 90 .

91 .Owing to the incorporation of inheritance concept in the proposed design. Database Designer can extend the database by effortlessly adding new tables. merely by inheriting the common fields from the generalized table pg.

a single method can do the classification process for all the tables. pg. I have exploited the maximum advantages of Object-Oriented Programming (OOP) and also the task of classification is performed more effectively. Data Mining Database in the Designed Object-Oriented ”Dynamic Polymorphism” or “Late Binding” allows the programmer to define methods with the same name in different classes and the method to be called is decided at runtime based on the calling object. As a result of the designed (OODB). 92 . Database Administrator can also access the method. The uniqueness of my concept is that the classification process can be performed by using simple SQL/ ORACLE query while the existing classification approaches for Object-Oriented Database (OODB) employ complex techniques such as decision trees. By integrating the polymorphism concept. nearest neighbor methods and more. specifically for individual entities namely Employees. the task of classification can be carried out effectively by using simple SQL/ORACLE queries. Suppliers and Customers. the code is simpler to write and easier to manage. neural networks. This Object-Oriented Programming (OOP) concept and simple SQL\ ORACLE queries can be used to perform classification in the designed Object-Oriented Database (OODB). Here. Thus in our approach by incorporating the Object-Oriented Programming (OOP) concepts for designing the Object-Oriented Database (OODB).

93 .Implementation and Results In this section. I have considered only three tables for experimentation. The incorporation of the Object-Oriented Programming (OOP) concepts to such databases greatly reduced the implementation overhead incurred. the number of records is enormous in each table. Table13. Table12. I have performed a comparative analysis through reviewing of Computer Reseller News (CRN) Magazines and COMPUTER Monthly Newspaper then came to a conclusion of the space utilized before and after generalization of tables and thus I have computed the saved memory space. These are some of the eminent benefits of the proposed approach. The proposed approach for the design of Object-Oriented Database (OODB) and classification has been designed with ORACLE as database. the memory space occupied is reduced to a great extent as the size of the table increases. The comparison is performed with varying number of records in the tables such as 1000. Table11. Table14 respectively. pg. But in general. 4000 and 5000 and the results are stated below in Table10. an organization may have a number of tables to manage. 2000. 3000. Moreover. Specifically. I have presented the experimental results of my approach.

pg. 94 .

Normalized Un Normalized Tables Fields Records Total Records Table of Memory size of the table 40000 50000 50000 240000 15000 10000 405000 Fields Total Records of the table Memory size of the table 1 2 3 4 5 6 Customers Employees Suppliers Persons Places Postalcodes 4 5 5 8 3 4 Total 1000 1000 1000 3000 500 250 4000 5000 5000 24000 1500 1000 40500 15 16 16 15000 16000 16000 150000 160000 160000 47000 470000 pg. 95 .

47656 Table 10: Saved Memory Table {Source: Computer Reseller News (CRN) Magazines} pg. 96 .Saved Memory (KB): 63.

97 .9531 Table 11: Saved Memory Table {Source: Computer Reseller News (CRN) Magazines} pg.Normalized Un Normalized Tables Fields Records Total Records Memory of Table of the table size Fields Total Records Memory of the table size of the table 1 2 3 4 5 6 Customers Employees Suppliers Persons Places Postal codes 4 5 5 8 3 4 Total 2000 2000 2000 6000 1000 500 8000 10000 10000 48000 3000 2000 81000 80000 100000 100000 480000 30000 20000 810000 94000 940000 15 16 16 30000 32000 32000 300000 320000 320000 Saved Memory (KB): 126.

98 .4297 Table 12: Saved Memory Table {Source: Computer Reseller News (CRN) Magazines} pg.Tables Fields Records Total Records Table of Memory size of the table Fields Total Records the table of Memory size of table 450000 480000 480000 the 1 Customers 2 Employees 3 Suppliers 4 Persons 5 Places 6 Postal codes Total 4 5 5 8 3 4 3000 3000 3000 9000 1500 750 12000 15000 15000 72000 4500 3000 121500 120000 150000 150000 720000 45000 30000 1215000 15 16 16 45000 48000 48000 141000 1410000 Saved Memory (KB):190.

Tables Fields Records Total Records Table of Memory size of table 160000 the Fields Total Records the table of Memory size of the table 600000 1 Customers 4 4000 16000 15 60000 2 Employees 3 Suppliers 4 Persons 5 Places 6 Postal codes Total 5 5 8 3 4 4000 4000 12000 2000 1000 20000 20000 96000 6000 4000 162000 200000 200000 960000 60000 40000 1620000 16 16 64000 64000 640000 640000 188000 1880000 Saved Memory (KB):253. 99 .9063 Table 13: Saved Memory Table {Source: Computer Reseller News (CRN) Magazines} pg.

Tables Fields Records Total Records of Table Memory size table 200000 250000 250000 1200000 75000 50000 2025000 of the Fields Total Records of the table Memory size of the table 750000 800000 800000 1 Customers 2 Employees 3 Suppliers 4 Persons 5 Places 6 Postal codes 4 5 5 8 3 4 Total 5000 5000 5000 15000 2500 1250 20000 25000 25000 120000 7500 5000 202500 15 16 16 75000 80000 80000 235000 2350000 Saved Memory (KB):317.3828 Table 14: Saved Memory Table {Source: Computer Reseller News (CRN) Magazines} pg. 100 .

101 .pg.

102 .pg.

as the number of records in each table increases. 103 . we have saved a considerable memory space. I have placed the common methods in the generalized class and entity-specific methods in the subclasses. pg. From the graph.The results of comparative analysis that the saved memory space increases. Because of this design. The Graphical Representation of the results is illustrated in Figure 11. it is clear Figure 11: Graph Demonstrating the above Evaluation Results Moreover in the proposed approach.

placing the common methods in the base class can save a memory space of Equation 1: To Determine the Memory Size pg. But in the proposed approach. the method has to be defined in all the classes and all those results are to be combined to obtain the final result. in case of a traditional database if a method getContactNumbers() is defined to get the contact numbers. 104 . I have generalized all the classes.For instance. so the redefinition of methods for all the related classes is not needed. If there are ‘n’ classes.

how do organization can make good use of the data it contains “Customer Relationship Management” (CRM) helps companies improve the profitability of their interactions with customers while at the same time making the interactions appear friendlier through individualization. However. Data mining is a process that uses a variety of data analysis and modeling techniques to discover patterns and relationships in data that may be used to make accurate predictions. and reduced costs due to properly allocating the business resources. 105 . to intelligently manage the "Customer Life Cycle”. companies need to match products and campaigns to prospects and customers in other words. called “Operational CRM”. and identify good customers who may be about to leave the product of the Organization. To succeed with CRM. the sheer volume of customer information and increasingly complex interactions with customers has propelled data mining to the forefront of making the Organization customer relationships profitable. The result is improved revenue because of a greatly improved ability to respond to each individual contact in the best way. offer the right additional products to Organization existing customers. It can help Data Miner to select the right prospects on whom to focus. pg. and providing that information in specific applications such as sales force automation and customer service in which the company “touches” the customer. CRM applications that use data mining are called Analytic CRM. has focused on creating a customer database that presents a consistent picture of the customer’s relationship with the company. Such software. Until recently most CRM software has focused on simplifying the organization and management of customer information.Building Profitable Customer Relationships with Data Mining Organization have to built the customer information and marketing data warehouse.

The case histories of these fictional companies are composites of real-life data mining applications. 106 . pg.This section of the project will describe the various aspects of analytic CRM and show how it is used to manage the customer life cycle more cost-effectively.

visually review it using charts and graphs. 107 . or which of several offers someone is most likely to accept. summarize its statistical attributes (such as means and standard deviations). A good model should never be confused with reality (Business man know a road map isn’t a perfect representation of the actual road). Data mining can be used for both classification and regression problems. and then test that model on results outside the original sample. pg. In “Regression Problems" Business Analyst are predicting a number such as the probability that a person will respond to an offer. An organization must "Build a Predictive Model" based on patterns determined from known results.Data Mining in Customer Relationship Management The first and simplest analytical step in data mining is to "Describe the Data" For example. whether a person will be a good credit risk or not. but it can be a useful guide to understanding the business. In "Classification Problems" Business Analyst predicting what category something will fall into For example. But data description alone cannot provide an action plan. and look at the distribution of values of the fields in the organization data.

For example. A special type of classification can recommend items based on similar interests held by groups of customers. It is also frequently used to identify a set of characteristics (called a profile) that segments customers into groups with similar behaviors. Regression and Collaborative Filtering problems is briefly described in the Appendix at the end of the project.In CRM. data mining is frequently used to assign a score to a particular customer or prospect indicating the likelihood that the individual will behave in the way Business Man want. pg. such as buying a particular product. 108 . The data mining technology used for solving Classification. a score could measure the propensity to respond to a particular offer or to switch to a competitor’s product. This is sometimes called "Collaborative Filtering".

Defining CRM "Customer Relationship Management" in its broadest sense simply means managing all customer interactions. I have refer to these stages as the customer life cycle. The customer life cycle has three stages: • Acquiring customers • Increasing the value of the customer • Retaining good customers Data mining can improve Business profitability in each of these stages through integration with operational CRM systems or as independent applications. pg. In practice. 109 . this requires using information about the Business customers and prospects to more effectively interact with Business customers in all stages of Business relationship with them.

differing mostly in the emphasis it places on the different steps. there are a number of steps the Business Man must follow. The Two Crows data mining process model described below is similar to other process models such as the CRISP-DM model. 110 . An effective statement of the problem will include a way of measuring the results of Business CRM project. For example. The basic steps of data mining for effective CRM are: • • • • • • • • Define business problem Build marketing database Explore data Prepare data for modeling Build model Evaluate model Deploy model and results Define the business problem.Applying Data Mining to CRM In order to build good models for the Business CRM system. The initial models implementor build may provide insights that lead implementor to create new variables. Keep in mind that while the steps appear in a list.” Business Analyst will build a very different model. the data mining process is not linear the CRM implementor will inevitably need to loop back to previous steps. Depending on business specific goal. pg. such as “increasing the response rate” or “increasing the value of a response. what implementer learns in the “explore data” step may require implementor to add new data to the data mining database. Each CRM application will have one or more business objectives for which Business Analyst will need to build the appropriate model.

and transaction databases. Together. Furthermore. if business want good models business analyst need to have clean data. These data preparation steps may take anywhere from 50% to 90% of the time and effort of the entire data mining process! Business Analyst will need to build a marketing database because business operational databases and corporate data warehouse will often not contain the data Business Man need in the format. business CRM applications may interfere with the speedy and effective execution of these systems. When business analyst build business marketing database Data Miner will need to clean it up. product database. Steps two through four constitute the core of the data preparation. This means business analyst will need to integrate and consolidate the data into a single marketing database and reconcile differences in data values from the various sources. pg. the same customer may have different names or worse multiple customer identification numbers. Making it more difficult to resolve these problems is that Big Sam’s Clothing Company are often subtle. Big Sam’s Clothing Company take more time and effort than all the other steps combined. The data business analyst need may reside in multiple databases such as the customer database.• Build a Marketing Database. There may be repeated iterations of the data preparation and model building steps as business analyst learn something from the model that suggests business analyst to modify the data. Some inconsistencies may be easy to uncover. There are often large differences in the way data is defined and used in different databases. Improperly reconciled data is a major source of quality problems. such as different addresses for the same customer. 111 . For example.

Business Man may want to produce cross tabulations (pivot tables) for multi-dimensional data. Data visualization most often provides the leading to new insights and success. 112 . Some of the common and very useful graphical displays of data are histograms or box plots that display distributions of values. and their importance to effective data analysis cannot be overemphasized.• Explore the data. Business Analyst must understand the Business data. The ability to add a third. Before business analyst can build good predictive models. Start by gathering a variety of numerical summaries (including descriptive statistics such as averages. overlay variable greatly increases the usefulness of some types of graphs pg. Business analyst may also want to look at scatter plots in two or three dimensions of different pairs of variables. standard deviations and so forth) and looking at the distribution of the data. Graphing and visualization tools are a vital aid in data preparation.

however. business analyst would take all the variables business analyst have. Last. business analyst will need to transform variables in accordance with the requirements of the algorithm business analyst choose for building business model. Next business analyst may decide to select a subset or sample of the data on which to build models. The next step is to construct new predictors derived from the raw data. 113 .. pg. One reason is that the time it takes to build a model increases with the number of variables. Working with a properly selected random sample usually results in no loss of information for most CRM problems. There are four main parts to this step: First business analyst wants to select the variables on which to build the model. Given a choice of either investigating a few models built on all the data or investigating more models built on a sample. using all Business data may take too long or require buying a bigger computer than business analyst would like. • Prepare data for modeling. In practice. of the problem. this doesn’t work very well. Ideally. feed them to the data mining tool and let it find those which are the best predictors. Another reason is that blindly including extraneous columns can lead to models with less rather than more predictive power. the latter approach will usually help business analyst to develop a more accurate and robust model. If business analysts have a lot of data. forecasting credit risk using a debt-to-income ratio rather than just debt and income as predictor variables may yield more accurate results that are also easier to understand. For example. This is the final data preparation step before building models and the step where the most “art” comes in.

What business analyst learn in searching for a good model may lead business analyst to go back and make some changes to the data business is using or even modify the problem statement. marketer then split this data into two groups. pg. Business analyst start with customer information for which the desired outcome is already known. A model is built when the cycle of training and testing is completed. Business analyst will need to explore alternative models to find the one that is most useful in solving the business problem. For example. Another measure that is frequently used is lift. Lift measures the improvement achieved by a predictive model. The Marketer may have historical data because Customer previously mailed to a list very similar to the one Marketer are using. Depending on whether Marketer chooses to maximize lift. However. Most CRM applications are based on a protocol called supervised learning. Or Marketer may have to conduct a test mailing to determine how people will respond to an offer. profit. lift does not take into account cost and revenue. or Return of Investment (ROI). Suppose marketers have an offer to which only 1% of the people will respond.• Data mining model building. so it is often preferable to look at profit or Return of Investment (ROI). • Evaluate the business results Perhaps the most overrated metric for evaluating the business results is accuracy. 114 . On the first group marketer train or estimate the model. marketer will choose a different percentage of the business mailing list to which marketer will send solicitations. Marketer then tests it on the remainder of the data. A model that predicts “nobody will respond” is 99% accurate and 100% useless. The most important thing to remember about model building is that it is an iterative process.

In this case the marketers would match the profiles of good prospects shown by the model to the profile of the people marketers advertisement would reach. part of the final product. The way data mining is actually built into the application is determined by the nature of the customer interaction. data mining is often only a small. There are two main ways Business interact with his customers: Business Man contact Customer (inbound) or customer contact Business Man (outbound). Outbound interactions are characterized by the company originating the contact such as in a direct mail campaign. pg. Thus marketers will be selecting the people to whom marketers mail by applying the model to Business customer database. For example. The deployment requirements are quite different. predictive patterns through data mining may be combined with the knowledge of domain experts and incorporated in a large application used by many different kinds of people. 115 .• Incorporating Data Mining in the business CRM solution In building a CRM application. Another type of outbound campaign is an advertising campaign. albeit critical.

116 .For inbound transactions. marketers must transform business input data accordingly. and gender fields. but the model requires the age-to-income ratio and gender has been changed into two binary variables. pg. Therefore the data mining model is embedded in the application and actively recommends an action. In either case. the application must respond in real time. an Internet order. The ease with which Business Analyst can embed these transformations becomes one of the most important productivity factors when marketers want to rapidly deploy many models. one of the key issues the business must deal with in applying a model to new data is the transformations marketers used in building the model. or a customer service call. Thus if the input data (whether from a transaction or a database) contains age. income. such as a telephone order.

and more accurate future projections involve longer histories in a database. the more the Business Analyst can tell about the future. and focus on trends that are currently spiking above the norm. The more of the past data the data miner have. The technologies now available on the open market enable all kinds of speculating and observations based in quantifiable data that may be housed in databases or other computerized resources. Steps to be follow • Collect a solid table of data going back several years from the present. sales. With visual graphing. Business Analyst will be able to see the results of the business data mining at a glance. • Set up visual graphing that shows Business Analyst how different behaviors have occurred over time. A future trend projection is only as good as the data it's based on. The graph will include a line for each product/behavior/trend that Business Analyst are considering. it's good to have a long data history. • Choose a long-term or short-term analysis. Business Analyst may want to look only at the short term. or other trends that are currently rising. 117 . • Set up algorithms to search through the business existing data looking for behaviors.How to Data Mine for Future Trends The exciting phenomenon of data mining has taken the business world by storm. As stated. but that's not necessarily the only criteria for future trends. Data mining can be used to track what customers are doing now and possibly even what customer will do in the future. pg.

Having background data and reasonable mitigating factors at hand will help Business Analyst make better decisions about the future.• Add other mitigating data as it is found. 118 . Write reports supplementary to the business data mining graphs that "explain" a trend and evaluate the chances of its continuance. pg.

But Database and Information Technology managers are also “interested parties” since their teams are often called upon to support the execution of those strategies. planning is half the battle. Plan how to achieve business objective by capitalizing on business resources. Avoid the “ad hoc trap” of mining data without defined business objectives. Orit might be to keep business most profitable customers longer. pg. Plan for data mining success by following these three steps:  Start with the end in mind. For example. Be sure to involve all those who have a stake in the project. and then convert this knowledge into a coherent data mining strategy and a well-defined project plan. Finance. define a project that supports the organization’s strategic objectives.  Get buy-in from business stakeholders. Prior to modeling. Both technical and staff resources must be taken into account. the business objective might be to attract additional customers who are similar to business most valuable current customers. Sales.  Define an executable data mining strategy. Typically. 119 . It is critical that any organization considering a data mining project first define project objectives from a business perspective. and Marketing are concerned with devising cost-effective CRM strategies.The 10 Secrets to Using Data Mining to Succeed at CRM • Planning is the key to a successful data mining project As with any worthwhile endeavor.

what level of improvement do organizations want to see? Next. For instance. Some projects may require two or three people. as well as those with specialized knowledge of business data resources and data mining. data miner. and because in most organizations elements of that business understanding are dispersed among different disciplines or departments. such as CRISPDM (Cross-Industry Standard Process for Data Mining). clarify just how data mining can help organization to achieve business goal. including a clear definition of what will constitute “success. complete a cost-benefit analysis. taking care to include the cost of any resources that will be required. pg. data expert. 120 .”Finally. Depending upon business objective. organization may need staff members from Customer Service. Because successful data mining requires a clear understanding of the business problem being addressed. organization may want to have representatives from some or all of the following roles: executive sponsor. • Recruit a broad-based project team One of the most common mistakes made by those new to data mining is to simply pass responsibility for a data mining initiative to a data miner. commit to a standard data mining process. Then create a project plan for achieving business goals. business expert. other projects may require more. if reducing customer defection or “churn” is a strategic objective. For instance. or even Billing. and IT sponsor.• Set specific goals for business data mining project Before organization begin a data mining project. to evaluate the factors involved in customer churn. project leader. Market Research. it’s important to recruit a broad-based team for business project.

121 . • Secure IT buy-in IT is an important component of any successful data mining initiative. For example.• Line up the right data To help ensure success. Analyst may be able to determine. which of business company’s products are typically purchased by customers fitting a certain demographic profile. Keep in mind that the data mining tool organization select will play an important role in securing buy-in from business IT department. The data mining tool should integrate with business existing data infrastructure relevant databases. It doesn’t need to be a large amount or organized in a data warehouse. and data marts and should provide open access to data and the capability to enhance existing databases with scores and predictions generated by data mining. it is critical to understand what kinds of data are available and what condition that data is in. This enables organization to predict what other customers might purchase or what offers the Customers might find most appealing. data warehouses. from a sample of customer records. Begin with data that is readily accessible. pg. containing only a few hundreds or thousands of records. Many useful data mining projects are performed on small or medium-sized datasets some.

or survey data. such as text.• Select the right data mining solution Successful. That’s because each type of data is likely to originate in a different system and exist in a variety of formats. efficient data mining requires data mining solutions that are open and well integrated. regardless of the type of data involved in the analysis Integration is also important during the “decision optimization” phase of predictive analytics. and then delivers those recommended actions to the systems or people that can effectively implement them. Using an integrated solution enables business analysts to follow a train of thought efficiently. Such a solution supports more widespread and rapid even real-time. the business will want a solution that links to operational systems. Web. 122 . delivery of predictive insight. An integrated solution is particularly important when incorporating additional types of data. such as the call center or marketing automation software. To support decision optimization. Organizations save time and improve the flow of analysis by selecting solutions that support every step of the process. Decision optimization determines which actions will drive optimal outcomes. pg.

If the business analysts are trying to learn why long-time customers are leaving. the business analyst enrich the information available for prediction.• Consider mining other types of data to increase the return on business data mining investment When the business analyst combine text. and why. begin by asking the following questions:  What kinds of business problems are we trying to solve?  What kinds of data do we have that might address these problems? The answers to these questions will help the business analyst to determine what kinds of data to include. 123 . or survey data with structured data used in building models. Even if the business analyst adds only one type of additional data. To determine if the company might benefit from incorporating additional types of data. pg. Web. for example. Incorporating multiple types of data will provide even greater improvements. the business will see an improvement in the results that the business analysts generate. the business analyst may want to analyze text from call center notes combined with results of customer focus groups or customer satisfaction surveys.

one that helps automate routine tasks the business analyst can do this without increasing staff. Gain more from the investment in data mining either by addressing additional related business challenges or by applying data mining in different departments or geographic regions. 124 . pg. With the right data mining solution. For example consider whether there are secondary challenges that the business analyst might now address such as trimming the cost of customer acquisition programs. If the company has already made progress on the top-priority challenges increasing the conversion rate for cross-selling campaigns.• Expand the scope of data mining to achieve even greater results One way that the business analyst can increase the Return on Investment ROI generated by data mining is by expanding the number of projects the business analyst undertake.

In early implementations of data mining. Today. to more efficiently incorporate updated predictions in their databases. pg. many companies used batch scoring. organizations that efficiently deploy results consistently achieve a higher ROI. It even became possible to automate the scheduling of updates and to embed scoring engines within existing applications.• Consider all available deployment options When mining data. Later. the business analyst can deploy models or scores in real time to systems that generate sales offers automatically or make product suggestions to Web site visitors. often conducted at off-peak hours. deployment consisted of providing analysts with models and managers with reports. using the latest data mining technologies. Data miner can also update models in real time and deploy results to customer-contact staff as organization interacts with customers. Models and reports had to be interpreted by managers or staff before strategic or tactical plans could be developed. to name just two possibilities. the business analyst can update even massive datasets containing billions of scores in just a few hours. 125 . In addition.

usage. Model management also provides a way to document model creation. These solutions foster greater collaboration and enterprise efficiency. Central model management also helps the organization avoid wasted or duplicated effort while ensuring that the most effective predictive models are applied to the business challenges.• Increase collaboration and efficiency through model management Look into data mining solutions that enable the business analyst to centralize the management of data mining models and support the automation of processes such as the updating of customer scores. pg. 126 . and application.

and Performance Management In the wake of the long-running massive industry consolidation in the Enterprise Software industry that reached its zenith with the acquisitions of Business Intelligence market leaders Hyperion. Business Intelligence. This is especially true given the dozens of innovative companies that each of these large best of breed vendors themselves had acquired before being acquired in turn. This market has in fact shown itself to be very vibrant. So what are the trends and where do we see the industry evolving to? Few of these are mutually exclusive. with a resurgence of innovative offerings springing up in the wake of the fall of the largest best of breed vendors. nothing could be further from the truth in the market overall. thankfully for the health of the industry. 127 . Cognos. and Business Objects in 2007.The Suggestion for Analytics. and Performance Management markets. but in order to provide some categorization to the discussion. While the pace of innovation has slowed to a crawl as the large vendors are midway through digesting the former best of breed market leaders. Business Intelligence. one could certainly have been forgiven for being less than optimistic about the prospects of innovation in the Analytics. this has been broken down as follows: pg.

• The Business Analyst witness the emergence of packaged strategy-driven execution applications. As we discussed in Driven to Perform: Risk-Aware Performance Management From Strategy Through Execution (Nenshad Bardoliwalla, Stephanie Buscemi, and Denise Broady, New York, NY, Evolved Technologist Press, 2009), the end state for next-generation business applications is not merely to align the transactional execution processes contained in applications like ERP, CRM, and SCM with the strategic analytics of performance and risk management of the organization, but for those strategic analytics to literally drive execution. We called this “Strategy-Driven Execution”, the complete fusion of goals, initiatives, plans, forecasts, risks, controls, performance monitoring, and optimization with transactional processes. Visionary applications such as those provided by Workday and with embedded real-time contextual reporting available directly in the application (not as a bolt-on), and Oracle’s entire Fusion suite layering Essbase and OBIEE capabilities tightly into the applications’ logic, clearly portend the increasing fusion of analytic and transactional capability in the context of business processes and this will only increase.

pg. 128

The holy grail of the predictive, real-time enterprise will start to deliver on its promises. While classic analytic tools and applications have always done a good job of helping users understand what has happened and then analyze the root causes behind this performance, the value of this information is often stale before it reaches its intended audience. The holy grail of analytic technologies has always been the promise of being able to predict future outcomes by sensing and responding, with minimal latency between event and decision point. This has become manifested in the resurgence of interest in eventdriven architectures that leverage a technology known as Complex Event Processing and predictive analytics. The predictive capabilities appear to be on their way to break out market acceptance IBM’s significant investment in setting up their Business Analytics and Optimization practice with 4000 dedicated consultants, combined with the massive product portfolio of the Cognos and recently acquired SPSS assets. Similarly, Complex Event Processing capabilities, a staple of extremely dataintensive, algorithmically-sophisticated industries such as financial services, have also become interesting to a number of other industries that cannot deal with the amount of real-time data being generated and need to be able to capture value and decide instantaneously. Combining these capabilities will lead to new classes of applications for business management that were unimaginable a decade ago.

pg. 129

The industry will put reporting and slice-and-dice capabilities in their appropriate places and return to its decision-centric roots with a healthy dose of Web 2.0 style collaboration. It was clear to the pioneers of this industry, beginning as early as H.P. Luhn’s brilliant visionary piece A Business Intelligence System from 1958 that the goal of these technologies was to support business decision-making activities, and we can trace the roots of modern analytics, business intelligence, and performance management to the decision-support notion of decades earlier. But somewhere along the way, business intelligence became synonymous with reporting and slicing-and-dicing, which is a metaphor that suits analysts, but not the average end-user. This has contributed to the paltry BI adoption rates of approximately 25% bandied about in the industry, despite the fact that investment in BI and its priority for companies has never been higher over the last five years. Making report production cheaper to the point of nearly being free, something BI is poised to do is still unlikely to improve this situation much. Instead, we will see resurgence in collaborative decision-centric business intelligence offerings that make decisions the central focus of the offerings. From an operational perspective, this is certainly in evidence with the proliferation of rulesbased approaches that can automate thousands of operational decisions with little human intervention. However, for more tactical and strategic decisions, mashups will allow users to assemble all of the relevant data for making a decision, social capabilities will allow users to discuss this relevant data to generate “crowd sourced” wisdom, and explicit decisions, along with automated inferences, will be captured and correlated against outcomes. This will allow decision-centric business intelligence to make recommendations within process contexts for what the appropriate next action should be, along with confidence intervals for the expected outcome,

pg. 130

131 .as well as being able to tell the user what the risks of her decisions are and how it will impact both the company’s and her own personal performance. pg.

Bruce Cleveland. and as risk management has evolved from its siloed roots into Enterprise Risk Management. modern approaches suggest that compliance is ineffective when cast as a process of signing off on thousand of individual item checklists. risk. now a partner at Interwest. no doubt thanks to experiencing over 100% year over year growth in the burgeoning Sales Performance Management category. risk. as we documented thoroughly in Driven to Perform. Similarly. and compliance management are clearly the areas of most significant investment for most companies. All three of these disciplines need to become unified in a process-based framework that allows for effective organizational governance. My former Siebel colleague. 132 . but rather should be based on an organization’s risks. pg. but the walls are breaking down. as compliance has become an extremely thorny and expensive issue for companies of all sizes. Performance management begins with the goals that the organization is trying to achieve. in the wake of Sarbanes-Oxley. risk. it has become clear that risks must be identified and assessed in light of this same goal context. makes the case for this market expansion of performance management into the front-office rather convincingly and has invested correspondingly. and compliance management will continue to become unified in a process-based framework and make the leap out of the CFO’s office.• Performance. And while financial performance. as vendors like Right90 continuing to gain traction in improving the sales forecasting process and vendors like Varicent receive hefty $35 million venture rounds this year. The disciplines of performance. We will continue to witness significant investment in sales and marketing performance management. and compliance management have been considered separate for a long time. it is clear that these concerns are now finally becoming enterprise-level plays that are escaping the confines of the Office of the CFO.

some collaboration features. and an ease-of-use not seen with their on-premise equivalents whereby users are able to manage the system in a self-sufficient fashion devoid of the need for significant IT involvement. vendors like Birst. strong visualization capabilities. 133 . From many accounts. Although much was made of the folding of LucidEra. Only vendors whose offerings were designed from the beginning for cloud-scale architecture and thus whose marginal cost per additional user approaches zero will pg. Business Intelligence. However. one of the original pioneers in the space. should thus expect to see continued dimunition of the on-premise vendors BI revenue streams as the BI value proposition goes mainstream. and while other vendors like BlinkLogic folded as well. with so many small players in the market offering largely similar capabilities. Good Data. From a functionality perspective. Business Intelligence. Tools will steal significant revenue from on-premise vendors but also fight for limited oxygen amongst themselves. and this certainly was in evidence with the significant uptick in investment and market visibility of Business Intelligence. vendors. these tools offer great usability. although it wouldn’t be surprising to see acquisitions by the large vendors to stem the tide. so there is little reason for any customer to invest in on-premise capabilities at the price/performance ratio that the vendors are offering . PivotLink. tools vendors may wind up starving themselves for oxygen as company put price pressure on each other to gain new customers. this was the year that BI based offerings hit the mainstream due to their numerous advantages over onpremise offerings. the Business Intelligence. Indicee and others continue to announce wins at a fair clip along with innovations at a fraction of the cost of their on-premise brethren. have long argued that basic reporting and analysis is now a commodity.• Cloud Business Intelligence.

while showing promising growth. Applications such as those offered by Host Analytics.succeed in such a commodity pricing environment. pg. have yet to mature to mainstream adoption. where the risks and rewards of competition are much higher. addressing key integration and security concerns will remain crucial to driving adoption. although alternatively these vendors can pursue going upstream and try to compete in the enterprise. but are poised to do so in the coming years. and new entrant Anaplan. On the other hand. As with all applications. 134 . packaged Business Intelligence. Adaptive Planning.

we have witnessed an explosion of exciting data management offerings in the last few years that have reinvigorated the information management sector of the industry. Greenplum. pg. columnar storage options. the RDBMS seemed to be the answer. The largest web players such as Google (BigTable). 135 . Aster Data. Vertica. Additionally. Facebook (Cassandra) have built their own solutions to handle their own incredible data volumes. a whole new industry of DBMSs dedicated to Analytic workloads have sprung up. Yahoo (Hadoop). and even the largest vendors like Oracle with their Exadata offering are excited enough to make significant investments in this space. company have never held the same dominant market share from an applications consumption perspective that the RDBMS vendors have enjoyed over the last few decades. However.• The undeniable arrival of the era of big data will lead to further proliferation in data management alternatives. No matter what the application type. There has never been the plethora of choices available as new entrants to the market seem to crop up weekly. and Microsoft Analysis Services. and the like with significant innovations in inmemory processing. with flagship vendors like Netezza. While analytic-centric OLAP databases have been around for decades such as Oracle Express. exploiting parallelism. Hyperion Essbase. and more. Visionary applications of this technology in areas like metereological forecasting and genomic sequencing with massive data volumes will become possible at hitherto unimaginable price points. We already starting to see hybrid approaches between the Hadoop players and the ADBMS players. with the open source Hadoop ecosystem and commercial offerings like CloudEra leading the charge in broad awareness. Additionally. Amazon (Dynamo). significant opportunities to push application processing into the databases themselves are manifesting themselves.

• Advanced Visualization will continue to increase in depth and relevance to broader audiences. players’ reporting tools. these capabilities will find their way into enterprise offerings at a rapid speed lest the gap between the consumer and enterprise realms become too large and lead to large scale adoption revolts as a younger generation begins to enter the workforce having never known the green screens of yore. and AJAX via frameworks like Google’s Web Toolkit augur the era of a revolution in state-of-the art visualization capabilities. Adobe Flex. QlikTech. The latest advances in state-of-the-art User interface technologies such as Microsoft’s SilverLight. 136 . Visionary vendors like Tableau. With consumers broadly aware of the power of capabilities like Google Maps or the tactile manipulations possible on the iPhone. and Spotfire (now Tibco) made their mark by providing significantly differentiated visualization capabilities compared with the trite bar and pie charts of most Business Intelligence. pg.

offerings are doing. offering complete end-to-end Business Intelligence. EsperTech for CEP.• Open Source offerings will continue to make in-roads against on-premise offerings. Individual parts of the stacks can also be assembled into compelling offerings and receive valuable innovations from both corporate entities as well as dedicated committers: JFreeChart for charting. Much as Business Intelligence. on-premise vendors. DynamoBI‘s LucidDB for ADBMS. pg. This is no doubt a function of the brutal economic times companies find themselves experiencing. Revolution Computing‘s R for statistical manipulation. and the list goes on. These offerings have absolutely reached a level of maturity where the companies are capable of being deployed in the enterprise right alongside any other commercial closed-source vendor offering. Mondrian and Jedox‘s Palo for OLAP Servers. market are disrupting the incumbent. Cloudera‘s enterprise Hadoop for massive data. stacks at a fraction of the cost of their competitors and thus seeing good bottom-up adoption rates. Actuate‘s BIRT for reporting. Vendors like Pentaho and JasperSoft are really starting to hit their stride with growth percentages well above the industry average. Open Source offerings in the larger Business Intelligence. 137 . closed-source. Talend for Data Integration / Data Quality / MDM.

138 . loaded. and unstructured information will all be able to be extracted. and Data Virtualization will merge with Master Data Management to form a unified Information Management Platform for structured and unstructured data. SAS. real-time and event data sources.• Data Quality. data quality and data integration will be interlocked hand-in-hand to ensure the right. Data quality has been the bain of information systems for as long as the companies have existed. and data quality issues contribute to significant losses in system adoption. Finally. cleansed data is moved to downstream sources by attacking the problem at its root. Increasingly. products. causing many an IT analyst to obsess over it. Data Integration. Of course. structured. suppliers. vendors like Composite Software are providing data virtualization capabilities. etc. whereby the definitions of key entities in the enterprise like customers. capable of addressing the federation of batch. with the amount of relevant data sources exploding in the enterprise and no way to integrate all the data sources into a single physical location while maintaining agility. productivity. Informatica. transformed. whereby canonical information models can be over layer on top information assets regardless of where the data are located. Vendors including SAP Business Objects. can be used to provide semantic unification over these distributed data sources. and Talend are all providing these capabilities to some degree today. pg. and time spent addressing them. semistructured. These disparate data sources will need to be harmonized by strong Master Data Management capabilities. and queried from this ubiquitous information management platform by leveraging the capabilities of text analytics capabilities that continue to grow in importance and combining them with data virtualization capabilities.

a server-based mode first released in 2007 called Excel Services. consumption. in-memory analytic engine that can allow Excel analysis on millions of rows of data at sub-second speeds. the launch of Power Pivot. scalable. and its adoption shows absolutely no sign of abating any time soon. this includes significantly enhanced charting capabilities. an extremely fast. Microsoft has invested significantly in ensuring its continued viability as we move past its second decade of existence. With Excel 2010's arrival. the number one analytic tool by far with a home on hundreds of millions of personal desktops. Microsoft will continue to make sure of that. For Excel specifically. being a first-class citizen in SharePoint. none will be any closer to succeeding any time soon. 139 . and the biggest disruptor. While many vendors have tried in vain to displace Excel from the desktops of the business user for more than two decades.Excel will continue to provide the dominant paradigm for end-user Business Intelligence. pg.

Getting people to fill out an application for the credit card is only the first step. 140 . poor credit risks are more likely to accept the offer than are good credit risks. The conversion rate measures the proportion of people who become credit card customers. for a net of about 1% of the mailing list becoming customers. which for Big Bank and Credit Card Company (BB&CC) is about 1% per campaign." For example look at how data mining can help manage the costs and improve the effectiveness of a customer acquisition campaign. Then Big Bank and Credit Card Company (BB&CC) must decide whether the applicant is a good risk and accept them as a customer. pg. Not surprisingly.Successful Stories of Implementing Data Mining in the Businesses  Acquiring new customers via Data Mining The first step in CRM is to "Identify prospects and convert them to Customers. Big Bank and Credit Card Company (BB&CC) annually conducts 25 direct mail campaigns each of which offers one million people the opportunity for a credit card. So while 6% of the people on the mailing list respond with an application. only about 16% of those are suitable credit risks.

000 people most efficiently. only 10.000.250. The cost of mailing the solicitations is about $1.000 will be good enough risks becoming customers.000.000 and carefully analyzed the results. And of those 60. 141 .Big Bank and Credit Card Company (BB&CC) experience of a 6% response rate means that within the million names are 60. It then combined these two models to find the people who were both good credit risks and most likely to respond to the offer. The challenge Big Bank and Credit Card Company (BB&CC) faces is getting to those 10. altering the terms of the offer Big Bank and Credit Card Company (BB&CC) are not going to get more than 60.000 responses. these customers will generate about $1.000 eventual credit card customers.000 responses. First Big Bank and Credit Card Company (BB&CC) did a test mailing of 50. building a predictive model of who would respond (using a decision tree) and a credit scoring model (using a neural net).00 per piece for a total cost of $1. reaching customers in different ways. Over the next couple of years. Data mining can improve this return. Unless Big Bank and Credit Card Company (BB&CC) changes the nature of the solicitation using different mailing lists.00.000 in profit for the bank (or about $125 each) for a net return from the mailing of $250.000 people who will respond to the solicitation. Although it won’t precisely identify the 1. pg. it will help focus marketing efforts much more cost-effectively.000.

000 of the 10.125.00 0 $1.0 00 $250. In other words.000 prospects is not profitable. the cost of $250.2%. Had Big Bank and Credit Card Company (BB&CC) mailed the other 250. While the targeted mailing only reaches 9.000) $125.000) Cost of mailing 00 0 Number of 10000 9000 -1000 responses Gross profit per response $125 $125 $0 Gross profit $1.000.000 $335. Items Old New Differerence Number of pieces $1.0 00 $375.000 people in the mailing list from which 700. 9.000. a 20% increase.00 ($250. reaching the remaining people on the mailing list.00 0 $40. The result was that from the 750.The model was applied to the remaining 950.0 $750.00 ($250.000 acceptable applications for credit cards were received.000 of gross profit for a net loss of $125. the response rate had risen from 1% to 1. The following Table summarizes the results.000 prospects.000 $85.0 $750.000 would have resulted in another $125. 142 . no model is perfect.000 people were selected for the mailing.000 pieces mailed overall (including the test mailing).000) 00 0 Mailed $1.000 $40.00 0 ($125.00 0 $0 $250.000 Net Profit Cost of model Final Profit Table 15: Cost Sheet of Mailing System Table {Source: Computer Reseller News (CRN) Magazines} pg.

143 . Even when Business man include the $40.000 cost of the data mining software. This translates to a return on investment for modeling of over 200% which far exceeded Big Bank and Credit Card Company (BB&CC) Return on Investment (ROI) requirements for this project. and people resources used for this modeling effort the net profit increased $85.Notice that the net profit from the mailing increased $125.000. pg. computer.000.

When a customer calls in to place an order. Next. Guns and Roses (G&R) look up the customer in the database and then proceed to take the order. And there are some customers who resent any attempt at all to cross-sell them on additional products. otherwise Guns and Roses (G&R) ask for a phone number or customer number from the catalog mailing label. Guns and Roses (G&R) also offer a line of indoor flower pots made from large caliber antique pistols and a collection of muskets that have been converted to unique holders of long stemmed flowers. the customer may get irritated and hang up without ordering anything. But Guns and Roses (G&R) had found that if the first suggestion fails and Guns and Roses (G&R) try to suggest a second item. 144 . pg. Their catalog is sent to about 12 million homes. Guns and Roses (G&R) has an excellent chance of selling the caller something additional cross-selling. Guns and Roses (G&R) identifies the caller using caller ID when possible. Increasing the Value of Business Existing Customers: Cross-Selling Via Data Mining Guns and Roses (G&R) is a company that specializes in selling antique mortars and cannons as outdoor flower pots.

And because making any recommendation is for some customers unacceptable. the odds of making the right recommendation were one in three.Before trying data mining. Guns and Roses (G&R) successfully sold 2% of the customers an additional product with virtually no complaints. Using the customer information in the database and the new order. pg. Without the model. Guns and Roses (G&R) counted anyone who declined to participate in the survey as someone who would find recommendations intrusive. Later on. Guns and Roses (G&R) found that the assumption was not warranted. to verify this assumption. To be conservative. Guns and Roses (G&R) had been reluctant to cross-sell at all. To their surprise. Guns and Roses (G&R) were reluctant to continue for such a small gain. As with that situation. Guns and Roses (G&R) found out how their customers would react by conducting a very short telephone survey. Guns and Roses (G&R) had less than a 1% sales rate and had a substantial number of complaints. 145 . The first model predicted whether someone would be offended by recommendations. Guns and Roses (G&R) wanted to be exceptionally sure that Guns and Roses (G&R) never made a recommendation when Guns and Roses (G&R) should not. The situation changed dramatically with the use of data mining. This enabled them to make more recommendations and further increase profits. Guns and Roses (G&R) made recommendations to a small but statistically significant subset of those who had refused to answer the survey questions. In a trial campaign. Developing this capability involved a process similar to solving the credit card customer acquisition problem. two models were needed. Now the data mining model operates on the data. it tells the customer service representative what to recommend.

The second model predicted which offer would be most acceptable. When the data mining models were incorporated in a typical cross-selling CRM campaign. 146 . In summary. data mining helped Guns and Roses (G&R) better understand their customers’ needs. pg. the models helped Guns and Roses (G&R) company to increase its profitability 2%.

If customers have a record of ordering from them. 147 . however. often based on complementing the item under consideration. When Big Sam’s Clothing Company first put up the site. In an on-line store. such as a waterproof down parka. the site can take into account not only the items that the customers are looking at. thus leading to even more customized recommendations pg. It was just an on-line version of their catalog. nicely and efficiently done but not taking advantage of the sales opportunities presented by the Web. there was none of this personalization. Whenever customers go to their site Big Sam’s Clothing Company greet customers with “Howdy Pardner!” but once customers have ordered or registered with them Big Sam’s Clothing company website greets customers by name. Increasing the Value of the Business Existing Customers: Personalization via Data Mining Big Sam’s Clothing Company has set up a website to supplement their catalog. Data mining greatly increased the sales at their website. In particular. Big Sam’s Clothing Company will suggest other things that might supplement such a purchase. Big Sam’s Clothing company website will also tell customers about any new products that might be of particular interest to customers. Catalogs frequently group products by type to simplify the user’s task of selecting products. When customers look at a particular product. the product groups may be quite different. but what is in customers shopping cart as well.

measurable increases in repeat sales. To extend their reach further.. Big Sam’s used clustering to discover which products grouped together naturally. such as shirts and pants. such as books about desert hiking and snakebite kits. The effort in personalization paid off for Big Sam’s with significant. First. Surveys established that consumers were viewed as a trusted advisor for clothing and gear. pg. but also solidified their relationship with the customer. Big Sam’s Clothing company website found that steering people to these selected products not only resulted in significant incremental sales. Others were surprising. and average size of a sale. Big Sam’s Clothing company website used these groupings to make recommendations whenever someone looked at a product. 148 . Big Sam’s Clothing company website then built a customer profile to help them identify those customers who would be interested in the new products Big Sam’s Clothing Company were always adding to their catalog. Big Sam’s started a program through which customers could elect to receive e-mail about new products that the data mining models predicted would interest customers. average number of sales per customer. Some of the clusters were obvious. Big Sam’s found it to be a program of profit improvement. While the customers viewed this as another example of proactive customer service.

The bulk of their users were dial-in clients (as opposed to clients who are always connected through a T1 or DSL line). The cost to replace these customers is $200 each. Retaining Good Customers Via Data Mining For almost every company. the number of e-mail accounts a user had. Know Internet Service Provider also knew the volume of data transferred to and from a user’s computer. the number of e-mail messages sent and received. 149 . 8% per month. the cost of acquiring a new customer exceeds the cost of keeping good customers. In addition. an Internet Service Provider (ISP). Big Sam’s Clothing Company had demographic data that customers provided at sign-up. Know Service whose attrition rate was the industry average.000 plenty of incentive to start an attrition management program. This was the challenge facing Know Service. and a customer’s service and billing history. Since Internet Service Provider (ISP) has one million customers.000 customers left each month. The first thing Know Service needed to do was prepare the data for predicting which customers would leave. or $16.000. Know Internet Service Provider needed to select the variables from their customer database and perhaps transform them. so Know Internet Service Provider knew how long each user was connected to the Web. this means 80. pg.

however. For example. Rather than use the raw time-series data. such as the average number of service calls and the change in the average number of service calls. Other predictors.Next Know Internet Service Provider needed to identify who were “good” customers. determining what data to use and how to combine existing data is where much of the challenge lies in model development. pg. Predicting who would churn. Know Internet Service Provider used this model not only for customer retention but to identify customers who were not yet profitable but might become so in the future. Big Sam’s Clothing Company needed to look at time-series data such as the monthly usage. As in most data mining problems. were symptoms rather than causes that could be directly addressed. such as declining usage. Know Internet Service Provider smoothed it by taking rolling three-month averages. Know Internet Service Provider identified some potential programs and offers that Know Internet Service Provider believed would entice people to stay. 150 . were indicative of customer satisfaction problems worth investigating. This is not a data mining question but a business definition (such as profitability) followed by a calculation. Know Service built a model to profile their profitable customers and their unprofitable customers. wasn’t enough. Based on the results of their modeling. Know Internet Service Provider also calculated the change in the three month average and tried that as a predictor. Know Service then built a model to predict who among their profitable customers would leave. Some of the factors that were good predictors.

One model identified likely churners. To summarize. 151 . Know Service found that their investment in data mining paid off by improving their customer relationships and dramatically increasing their profitability. Some users were offered as more free disk space for personal web pages.000 per month. and the third model matched the potential churners with the most appropriate offer. The net result was a reduction in their churn rate from 8% to 7. pg. for a savings in customer acquisition costs of $1. the next model picked out the profitable ones worth keeping.000. the churn project made use of all three models. Know Internet Service Provider tried offering these users a higher fee service that included more bundled time. Know Internet Service Provider then built models that would predict which would be the effective offer for a particular user. some churners were exceeding even the largest amount of usage available for a fixed fee and were paying substantial incremental usage fees.For example.5%.

business organizations do not have sufficient security systems to protect the information that organization obtained through data mining from unauthorized access. but brute force navigation of data is not enough. pg. In the future. A new technological leap is needed to structure and prioritize information for specific end-user problems. and new products are on the horizon that will bring this integration to an even wider audience of users. the major flaw with data mining is that it increases the risk of privacy invasion. governments. However. However. and market information have resulted in an explosion of information. Competition requires timely and sophisticated analysis on an integrated view of the data. Comprehensive data warehouses that integrate operational data with customer. supplier. then the use of data mining may be supported. Quantifiable business benefits have been proven through the integration of data mining with current information systems. The data mining tools can make this leap. Currently. there is a growing gap between more powerful storage and retrieval systems and the users’ ability to effectively analyze and act on the information organization contain.Conclusion Data mining can be beneficial for businesses. Both relational and OLAP technologies have tremendous capabilities for navigating massive data warehouses. society as well as the individual person. when companies are willing to spend money to develop sufficient security system to protect consumer data. though the use of data mining should be restricted. 152 .

and data mining is the essential guide. The Experimental results have demonstrated the effectiveness of the presented approach. This approach will successfully reduce the implementation overhead incurred in the design of an (OODB). The route to a successful business requires that Business Man understand the customers and their requirements. In this Research. Owing to this design of Object-Oriented Database (OODB). My approach will reduced the amount of memory space inquired for storing databases that grow in size Customer Relationship Management is essential to compete effectively in today’s marketplace. an efficient classification task has been achieved by utilizing simple SQL\ORACLE queries. But operational CRM needs analytical CRM with predictive data mining models at its core. pg. 153 .Data mining has been gaining tremendous interest and hence research on data mining has mushroomed within the last few decades. The Object Oriented Programming concepts such as “Inheritance and Polymorphism” have been utilized in the presented approach. A promising approach for managing complex information and user defined data types is by incorporating Object-Orientation Concepts into Relational Database Management Systems. The more effectively Business Analyst can use the information about business customers to meet their needs the more profitable the Business will be. I have presented an approach for the design of an ObjectOriented Database and performing classification effectively in it.

Similarly. With a solid understanding of the issues to be addressed. pg. and other customer-related issues. can experience the business benefits that other organizations are reaping from data mining. customer value management. the business analyst. and the right solution.Inspire of the often weird accuracy of insight that data mining provides. the keys to effectively using data mining are not secret or mysterious. it is not magic. 154 . marketing optimization. It’s a valuable business tool that organizations around the globe are successfully using to make critical business decisions about customer acquisition and retention. too. appropriate resources and support.

155 .APPENDIX – I List of Figures Figure 1: The Database System Figure2: Data Mining is the core of Knowledge Discovery Process Figure 3: Fragments of some relations from a relational database for VideoStore Figure 4: A multi-dimensional data cube structure commonly used in data for data warehousing Figure 5: Summarized data from VideoStore before and after drilldown and roll-up operations Figure 6: Fragment of a transaction database for the rentals at VideoStore Figure 7: Visualization of spatial OLAP (from GeoMiner system) pg.

Figure 8: Examples of Time-Series Data (Source: Thompson Investors Group) Figure 9: Class Structure of Employees, Suppliers and Customers Table Figure 10: Inheritance Hierarchy of Classes in the Proposed OODB Design Figure 11: Graph Demonstrating the above Evaluation Results Figure 12: Decision Tree Figure 13: Generalization \ Specialization

pg. 156

APPENDIX – II List of Tables Table 1: Example of Employees Table Table 2: Example of Customers Table Table 3: Example of Suppliers Table Table 4: Example of Persons Table Table 5: Example of Extended Employees Table Table 6: Example of Extended Suppliers Table Table 7: Example of Extended Customers Table Table 8: Example of Extended Places Table

pg. 157

Table 9: Example of Extended PostalCodes Table Table 10: Saved Memory Table {Source: Computer Reseller News (CRN) Magazines} Table 11: Saved Memory Table {Source: Computer Reseller News (CRN) Magazines} Table 12: Saved Memory Table {Source: Computer Reseller News (CRN) Magazines} Table 13: Saved Memory Table {Source: Computer Reseller News (CRN) Magazines} Table 14: Saved Memory Table {Source: Computer Reseller News (CRN) Magazines} Table 15: Cost Sheet of Mailing System Table {Source: Computer News}

pg. 158

APPENDIX – III List of Equations Equation 1: To Determine the Memory Size pg. 159 .

Street pg. BirthDate date. PostalCode PostalCodes). Gender varchar2(6). Region Varchar2(15). 160 .City Varchar2(15). Name varchar2(15). Create Table Persons Create type Persons as Object(PersonID number(6). Place Places. Age number(3).Country Varchar2(15)).phone number(10)). MaritalStatus varchar2(9). Create Table Places Create type Places as Object(PlaceID Varchar2(15).APPENDIX – IV SQL Queries Create Table Statement Create Table PostalCodes Create type PostalCodes as Object(PostalCode number(6). number(6).

161 .HomePage varchar2(20)).Fax number(10). Create Table Suppliers Create type Suppliers as Object(SupplierID Persons. title varchar2(10).Create Table Employees Create type Employees as Object (EmployeeID Persons. Extension date).Hiredate date.ContactTitle varchar2(15).Fax number(10)).TitleofCourtesy Varchar2(4). Create Table Customers Create type Customers as Object(CustomerID Persons. ContactTitle varchar2(15). CompanyName varchar2(20). CompanyName varchar2(20). pg.

V Abbreviation and Synonyms RDBMS: Relational Database Management Systems OODB: Object-Oriented Database OOP: Object-Oriented Programming OODBMS: Object-Oriented Database Management Systems OOPL: Object-Oriented Programming Language VRML: Virtual Reality Markup Language / Virtual Reality Modeling Language NASA: National Aeronautics and Space Administration (USA) CAD: Computer Aided Design CAM: Computer Aided Manufacturing pg. 162 .APPENDIX .

CASE: Computer Aided Software Engineering ROI: Return on Investment CRISP-DM: Cross-Industry Standard Process for Data Mining CRM: Customer Relationship Management BB&CC: Big Bank and Credit Card Company G&R: Guns and Roses Company ISP: Internet Service Provider BI: Business intelligence pg. 163 .

Repositories often consist of several databases tied together by a common search engine. such as unstructured text. Noise data can be caused by hardware failures. widely-implemented strategy for managing and nurturing a company’s interactions with clients and sales prospects. industry abbreviations and slang can also impede machine reading. However. Noise Data: Noise data is meaningless data. Any data that has been received. Spelling errors. 164 .Repository: A repository is a collection of resources that can be accessed to retrieve information. or changed in such a manner that it cannot be read or used by the program that originally created it can be described as noisy. Customer Relationship Management: Customer Relationship Management is a broadly recognized. pg. Noise data unnecessarily increases the amount of storage space required and can also adversely affect the results of any data mining analysis. stored. Statistical analysis can use information gleaned from historical data to weed out noisy data and facilitate data mining. The term has often been used as a synonym for corrupt data. programming errors and gibberish input from speech or optical character recognition (OCR) programs. its meaning has expanded to include any data that cannot be understood and interpreted correctly by machines.

which are Web usage mining. Web content mining and Web structure mining. which are only practical for use while deployed in a stationary configuration. It is often used in ordinary language to denote a problem of understanding that comes down to word selection or connotation. Semantics: Semantics is the study of meaning. Nearest Neighbor Method: Nearest Neighbor search (NNS). Niche: In ecology. from the popular to the highly technical. According to analysis targets.Web Mining: Web Mining is the application of data mining techniques to discover patterns from the Web. Web Mining can be divided into three different types. a dolphin could potentially be in another ecological niche from one that travels in a different pod if the members of these pods utilize significantly different. Mobile Computing: Mobile Computing is a generic term describing one's ability to use technology while moving. a niche ( or ) is a term describing the relational position of a species or population in its ecosystem to each other. The problem is: given a set S of points in a metric space M and a query point q belongs to M. e.g. similarity search or closest point search. find the closest point in S to q. as opposed to portable computers. also known as proximity search. usually in language. The word "semantics" itself denotes a range of ideas. is an optimization problem for finding closest points in metric spaces. 165 . pg.

and city. Employee Each of these person types is described by a set of attributes that includes all the attributes of entity set person plus possibly additional attributes. customer entities may be described further by the attribute customer-id. a subset of entities within an entity set may have attributes that are not shared by all the entities in the entity set. Consider an entity set person. with attributes name. whereas employee entities may be described further by the attributes employee-id and salary. The specialization of person allows us to distinguish among persons according to whether the people are employees or customers. A person may be further classified as one of the following: 1. The process of designating sub groupings within an entity set is called specialization. For instance. For example. street.Generalization: An entity set may include subgroupings of entities that are distinct in some way from other entities in the set. Customer 2. 166 . pg. The E-R model provides a means for representing these distinctive entity groupings.

respectively. person is the higher-level entity set and customer and employee are lowerlevel entity sets. Higher. city. city. 167 .and lower-level entity sets also may be designated by the terms super class and subclass. pg. generalization is a simple inversion of specialization. The person entity set is the super class of the customer and employee subclasses.Specialization: The design process may also proceed in a bottom-up manner. There are similarities between the customer entity set and the employee entity set in the sense that entities have several attributes in common. street. The database designer may have first identified a customer entity set with the attributes name. In our example. and an employee entity set with the attributes name. we do not distinguish between specialization and generalization. and customer-id. We will apply both processes. In terms of the E-R diagram itself. and salary. street. This commonality can be expressed by generalization. which is a containment relationship that exists between a higher-level entity set and one or more lower-level entity sets. in which multiple entity sets are synthesized into a higher-level entity set on the basis of common features. New levels of entity representation will be distinguished (specialization) or synthesized (generalization) as the design schema comes to express fully the database application and the user requirements of the database. employee-id. For all practical purposes. in the course of designing the E-R schema for an enterprise. in combination.

Figure 13: Generalization \ Specialization pg. Generalization proceeds from the recognition that a number of entity sets share some common features (namely. entities are described by the same attributes and participate in the same relationship sets).Differences in the two approaches may be characterized by their starting point and overall goal. 168 .

is a language feature that allows a subclass to provide a specific implementation of a method that is already provided by one of its superclasses. }. Base::DoSomething(). class Derived : public Base { public: virtual void DoSomething() { y = y + 5. pg. } private: int y. }. in object oriented programming.Method Override: Method overriding.} private: int x. For Example: class Base { public: virtual void DoSomething() {x = x + 5. The implementation in the subclass overrides (replaces) the implementation in the superclass. 169 .

int number = mn.Multiply(2. } MultiplyNumbers mn = new MultiplyNumbers(). 4) // result = 24 } pg. } public int Multiply(int a. Compiler automatically selects the most appropriate method based on the parameter supplied. int c) { return a*b*c. 170 . 3.Method Overload: Method overloading allows us to write different version of the same method in a class or derived class. int b) {return a * b. For Example: public class MultiplyNumbers { public int Multiply(int a. 3) // result = 6 int number1 = mn. int b.Multiply(2.

titles. and obligations upon the death of an individual. pg. 171 . The rules of inheritance differ between societies and have changed over time. relating to polymorphism (any sense). an Interface in Computer science refers to a set of named operations that can be invoked by clients. Interface: An interface in the Java programming language is an abstract type that is used to specify an interface (in the generic sense of the term) that classes must implement. Inheritance: Inheritance is the practice of passing on property.Polymorphic: In computer science. The concept of parametric polymorphism applies to both data types and functions. debts. polymorphism is a programming language feature that allows values of different data types to be handled using a uniform interface. It has long played an important role in human societies. In object-oriented programming (OOP). able to have several shapes or forms. Inheritance is employed to help reuse existing code with little or no modification. Interface generally refers to an abstraction that an entity provides of itself to the outside. inheritance is a way to form new classes (instances of which are called objects) using classes that have already been defined.

members. properties or attributes. and the resulting composition as a structure. have no independent existence. However. or composite type. It is only called composite if the objects it refers to are really its parts. 172 . storage record. user-defined type (UDT).e.Composite Object Modeling: In programming languages. i. Fields are given a unique name so that each one can be distinguished from the others. composite objects are usually expressed by means of references from one object to another. depending on the language. such references may be known as fields. tuple. pg. having such references doesn't necessarily mean that an object is a composite.

By navigating the decision tree business analysis can assign a value or class to a case by deciding which branch to take. Decision trees models are commonly used in data mining to examine the data and induce a tree and its rules that will be used to make predictions.Decision trees: A decision tree (or tree diagram) is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences. including chance event outcomes. and utility Decision trees are a way of representing a series of rules that lead to a class or value. starting at the root node and moving to each subsequent node until a leaf node is reached. or root node. resource costs. business man may wish to offer a prospective customer a particular product. Quest. and C5. The figure shows a simple decision tree that solves this problem while illustrating all the basic components of a decision tree: the decision node. branches and leaves. called a leaf node. A number of different algorithms may be used for building decision trees including CHAID (Chi-squared Automatic Interaction Detection). For example. Each node uses the data from the case to choose the appropriate branch. Each branch will lead either to another decision node or to the bottom of the tree. which specifies a test to be carried out. pg.0. A simple classification tree. The first component is the top decision node. 173 . CART (Classification and Regression Trees).

or formal application: a candidate who solicited votes among the factory workers. whereas classification is a way to segment data by assigning it to groups that are already defined. someone who is knowledgeable in the business must interpret the clusters. these clusters may then be used to classify new data. Clustering is a way to segment data into groups that are not previously defined. unwilling. Segmentation refers to the general problem of identifying groups that have common characteristics. 174 .Reluctant: not eager. pg. as given by a superior. and whose members are very similar to each other. After Marketers have found clusters that reasonably segment the Business database. Clustering: Clustering divides a database into different groups. Some of the common algorithms used to perform clustering include Kohonen feature maps and K-means. or by which attributes the data will be clustered. The goal of clustering is to find groups that are very different from each other. disinclined Warranted: Authorization or certification. Attrition Rate: The rate of shrinkage in size or number Solicitation: To seek to obtain by persuasion. Marketers don’t know what the clusters will be when Marketers start. Consequently. Unlike classification. Don’t confuse clustering with segmentation. sanction. entreaty.

175 . (Actual biological neural networks are incomparably more complex. Neural networks are of particular interest because neural networks offers a means of efficiently modeling large and complex problems in which there may be hundreds of predictor variables that have many interactions. can learn by trial and error. pg.) Neural nets are most commonly used for regressions but may also be used in classification problems.Neural Networks: A computer architecture in which processors are connected in a manner suggestive of connections between neurons.

pg.Notes --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------. 176 .