P. 1
data warehouse,olap and data mining

data warehouse,olap and data mining


|Views: 2,458|Likes:
Published by eumine

More info:

Published by: eumine on Sep 21, 2008
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PPT, PDF, TXT or read online from Scribd
See more
See less





Data Warehouse & Data Mining


„ Part 1: Data Warehouses „ Part 2: OLAP „ Part 3: Data Mining


Part 1: Data Warehouses


Data, Data everywhere yet ...
„ I can’t find the data I need
ƒ data is scattered over the network many versions, subtle I ƒcan’t get the data I need ƒ differences need an expert to get the data


„ I can’t understand the data I found „ I can’t use the data I found
ƒ results are unexpected ƒ data needs to be transformed from one form to other ƒ available data poorly documented


What is a Data Warehouse?
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]

Why Data Warehousing?
Which are our lowest/highest margin customers ? What is the most effective distribution channel? Who are my customers and what products are they buying?

What product prom-otions have the biggest impact on revenue? What impact will new products/services have on revenue and margins?

Which customers are most likely to go to the competition ?


Decision Support
„ Used to manage and control business „ Data is historical or point-in-time „ Optimized for inquiry rather than update „ Used by managers and end-users to understand the business and make judgements

The Evol uti on of Da ta Warehous ing
„ Since 1970s, organizations gained competitive advantage through systems that automate business processes to offer more efficient and cost-effective services to the customer. „ This resulted in accumulation of growing amounts of data in operational databases.

Data Wa rehousi ng Concept s
„ A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process (Inmon, 1993).


Sub ject- ori ente d Data
„ The warehouse is organized around the major subjects of the enterprise (e.g. customers, products, and sales) rather than the major application areas (e.g. customer invoicing, stock control, and product sales). „ This is reflected in the need to store decision-support data rather than application-oriented data.


Dat a

„ The data warehouse integrates corporate application-oriented data from different source systems, which often includes data that is inconsistent. „ The integrated data source must be made consistent to present a unified view of the data to the users.

Time -v ariant Data
„ Data in the warehouse is only accurate and valid at some point in time or over some time interval. „ Time-variance is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots.

Non- vol ati le D ata
„ Data in the warehouse is not updated in real-time but is refreshed from operational systems on a regular basis. „ New data is always added as a supplement to the database, rather than a replacement.

Benef its of D ata Warehous ing
„ Potential high returns on investment „ Competitive advantage „ Increased productivity of corporate decision-makers

Com pa rison of OLT P Systems and Data Warehous ing


Data Wa rehouse Queri es
„ The types of queries that a data warehouse is expected to answer ranges from the relatively simple to the highly complex and is dependent on the type of end-user access tools used. „ End-user access tools include:
ƒ Reporting, query, and application development tools ƒ Executive information systems (EIS) ƒ OLAP tools


Exampl es of Typi cal D ata War eho use Qu er ies
„ „ „ What was the total revenue for Scotland in the third quarter of 2004? What was the total revenue for property sales for each type of property in Great Britain in 2003? What are the three most popular areas in each city for the renting of property in 2004 and how does this compare with the figures for the previous two years? What is the monthly revenue for property sales at each branch office, compared with rolling 12-monthly prior figures? What would be the effect on property sales in the different regions of Britain if legal costs went up by 3.5% and Government taxes went down by 1.5% for properties over £100,000? Which type of property sells for prices above the average selling price for properties in the main cities of Great Britain and how does this correlate to demographic data? What is the relationship between the total annual revenue generated by each branch office and the total number of sales staff assigned to each branch office?
© Pearson Education Limited 1995, 2005

„ „




Probl ems of Da ta Warehous ing
„ Underestimation of resources for data loading „ Hidden problems with source systems „ Required data not captured „ Increased end-user demands „ Data homogenization

Probl ems of Da ta Warehous ing
„ High demand for resources „ Data ownership „ High maintenance „ Long duration projects „ Complexity of integration

Typic al Archi tecture a Data War eh ouse



Data Mart
„ A subset of a data warehouse that supports the requirements of a particular department or business function. „ Characteristics include
ƒ Focuses on only the requirements of one department or business function. ƒ Do not normally contain detailed operational data unlike data warehouses. ƒ More easily understood and navigated.

Reas on s for Cr eati ng a Data Mart
„ To give users access to the data they need to analyze most often. „ To provide data in a form that matches the collective view of the business function area. data by a group of users in a department or „ To improve end-user response time due to the reduction in the volume of data to be accessed.

Reas on s for Cr eati ng a Data Mart
„ To provide appropriately structured data as dictated by the requirements of the end-user access tools. „ Building a data mart is simpler compared with establishing a corporate data warehouse. „ The cost of implementing data marts is normally less than that required to establish a data warehouse. 23

Reas on s for Cr eati ng a Data Mart
„ The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than a corporate data warehouse project.


From the Data Warehouse to Data Marts
Individually Structured Departmentally Structured

Less History Normalized Detailed More

Organizationally Data Warehouse Structured


Part 2: OLAP


Nature of OLAP Analysis
„ Aggregation -- (total sales, percentto-total) „ Comparison -- Budget vs. Expenses „ Ranking -- Top 10, quartile analysis „ Access to detailed and aggregate data „ Complex criteria specification „ Visualization
„ Need interactive response to aggregate

Bus iness Int ell ig ence Technolo gie s OLAP & Data Mining

„ Accompanying the growth in data warehousing is an ever-increasing demand by users for more powerful access tools that provide advanced analytical capabilities. „ There are two main types of access tools available to meet this demand, namely Online Analytical Processing (OLAP) and data mining. 28

Bus ines s Int elligence Technol ogi es
„ OLAP and Data Mining differ in what they offer the user and because of this they are complementary technologies. „ An environment that includes a data warehouse (or more commonly one or more data marts) together with tools such as OLAP and /or data mining are collectively referred to as Business Intelligence 29

Onl ine A nal yti cal Processi ng (OLA P)
„ The dynamic synthesis, analysis, and consolidation of large volumes of multi-dimensional data, Codd (1993). „ Describes a technology that uses a multi-dimensional view of aggregate data to provide quick access to strategic information 30 for the purposes of advanced

Onl ine A nal yti cal Processi ng (OLA P)
„ Enables users to gain a deeper understanding and knowledge about various aspects of their corporate data through fast, consistent, interactive access to a wide variety of possible views of the data. „ Allows users to view corporate data in such a way that it is a better model of the true dimensionality of the enterprise.

Onl ine A nal yti cal Processi ng (OLA P)
„ Can easily answer ‘who?’ and ‘what?’ questions, however, ability to answer ‘what if?’ and ‘why?’ type questions distinguishes OLAP from generalpurpose query tools. „ Types of analysis ranges from basic navigation and browsing (slicing and dicing) to calculations, to more complex analyses such as time series and complex modeling.

Exa mple s of OLAP appl ication s i n var ious functio nal areas


OLAP Appl icati ons
„ Although OLAP applications are found in widely divergent functional areas, they all have the following key features:
ƒ multi-dimensional views of data ƒ support for complex calculations ƒ time intelligence


OLAP Appl icati ons support for compl ex cal cula tion s
„ Must provide a range of powerful computational methods such as that required by sales forecasting, which uses trend algorithms such as moving averages and percentage growth.


OLAP Appl icati ons – ti me intel ligence
„ Key feature of almost any analytical application as performance is almost always judged over time. „ Time hierarchy is not always used in the same manner as other hierarchies. „ Concepts such as year-to-date and period-over-period comparisons should be easily defined.


OLAP Benef its
„ Increased productivity of end-users. „ Reduced backlog of applications development for IT staff. „ Retention of organizational control over the integrity of corporate data. „ Reduced query drag and network traffic on OLTP systems or on the data warehouse. „ Improved potential revenue and profitability.

Rep resentati on of Mu lti- di mensi onal Data
„ Example of two-dimensional query.
‚ What is the total revenue generated by property sales in each city, in each quarter of 2004?’

„ Choice of representation is based on types of queries end-user may ask. „ Compare representation - three-field relational table versus twodimensional matrix.

Mu lti-dimensional D ata as Three-field table ve rsus Two-dimens io nal Matr ix


Rep resentati on of Mu lti- di mensi onal Data
„ Example of three-dimensional query.
ƒ ‘What is the total revenue generated by property sales for each type of property (Flat or House) in each city, in each quarter of 2004?’

„ Compare representation - fourfield relational table versus three-dimensional cube.

Mul ti -di mensi onal D ata as Four- fi el d Tab le versu s Threedi mensi onal Cube


Rep resentati on of Mu lti- di mensi onal Data
„ Cube represents data as cells in an array. „ Relational table only represents multi-dimensional data in two dimensions.


Multi-dimensional Data
„ Measure - sales (actual, plan, variance) Dimensions: Product, Region, Time
W S N Juice Cola Milk Cream Toothpaste Soap 1 2 34 5 6 7

R eg io n

Hierarchical summarization paths
Product Industry Region Country Time Year






City Office

Month Day



Strengths of OLAP
„ It is a powerful visualization tool „ It provides fast, interactive response times „ It is good for analyzing time series „ It can be useful to find some clusters and outliners „ Many vendors offer OLAP tools


OLAP and Executive Information Systems
„ Andyne Computing -Pablo „ Arbor Software -Essbase „ Cognos -- PowerPlay „ Comshare -Commander OLAP „ Holistic Systems -Holos „ Information Advantage -- AXSYS, WebOLAP „ Informix -- Metacube „ Microstrategies -DSS/Agent „ Oracle -- Express „ Pilot -- LightShip „ Planning Sciences -Gentium „ Platinum Technology -- ProdeaBeacon, Forest & Trees „ SAS Institute -SAS/EIS, OLAP++ „ Speedware -- Media


Part 3: Data Mining


Data Min ing
„ The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions, (Simoudis,1996). „ Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of 47 data.

Data Min ing
„ Reveals information that is hidden and unexpected, as little value in finding patterns and relationships that are already intuitive. „ Patterns and relationships are identified by examining the underlying rules and features in the data.

Data Min ing
„ Most accurate results normally require large volumes of data to deliver reliable conclusions. „ Starts by developing an optimal representation of structure of sample data

Data Min ing
„ Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing. „ Relatively new technology, however already used in a number of industries.

Exa mple s of Appl icati ons of D ata Mini ng
„ Retail / Marketing
ƒ Identifying buying patterns of customers ƒ Finding associations among customer demographic characteristics ƒ Predicting response to mailing campaigns ƒ Market basket analysis

Exa mple s of Appl icati ons of D ata Mini ng
„ Banking
ƒ Detecting patterns of fraudulent credit card use ƒ Identifying loyal customers ƒ Predicting customers likely to change their credit card affiliation ƒ Determining credit card spending by customer groups

Exa mple s of Appl icati ons of D ata Mini ng
„ Insurance
ƒ Claims analysis ƒ Predicting which customers will buy new policies

„ Medicine
ƒ Characterizing patient behavior to predict surgery visits ƒ Identifying successful medical therapies for different illnesses

Data Min ing Operat ions
„ Four main operations include:
ƒ ƒ ƒ ƒ Predictive modeling Database segmentation Link analysis Deviation detection

„ There are recognized associations between the applications and the corresponding operations.
ƒ e.g. Direct marketing strategies use database segmentation.

Data Min ing Tec hniques
„ Techniques are specific implementations of the data mining operations. „ Each operation has its own strengths and weaknesses.


Data Mi ning Operat ions and Ass ocia ted Techniq ue s


Predi ctive Model ing
„ Similar to the human learning experience
ƒ uses observations to form a model of the important characteristics of some phenomenon.

„ Uses generalizations of ‘real world’ and ability to fit new data into a general framework. „ Can analyze a database to determine essential characteristics (model) 57 about the data set.

Predi ctive Model ing
„ Model is developed using a supervised learning approach, which has two phases: training and testing.
ƒ Training builds a model using a large sample of historical data called a training set. ƒ Testing involves trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics.

Predi ctive Model ing
„ Applications of predictive modeling include customer retention management, credit approval, cross selling, and direct marketing. „ There are two techniques associated with predictive modeling: classification and value prediction, which are distinguished by the nature of the variable being predicted.

Exampl e of Cl assi ficati on usi ng Tree I nducti on


Predi ctive Model ing Val ue Predi ctio n
„ Used to estimate a continuous numeric value that is associated with a database record. „ Uses the traditional statistical techniques of linear regression and nonlinear regression. „ Relatively easy-to-use and understand.

Predi ctive Model ing Val ue Predi ctio n
„ Linear regression attempts to fit a straight line through a plot of the data, such that the line is the best representation of the average of all observations at that point in the plot. „ Problem is that the technique only works well with linear data and is sensitive to the presence of outliers (that is, data values, which do not conform to the expected norm).

Predi ctive Model ing Val ue Predi ctio n
„ Data mining requires statistical methods that can accommodate non-linearity, outliers, and nonnumeric data. „ Applications of value prediction include credit card fraud detection or target mailing list identification.

Data bas e Segmenta tion
„ Aim is to partition a database into an unknown number of segments, or clusters, of similar records. „ Uses unsupervised learning to discover homogeneous subpopulations in a database to improve the accuracy of the profiles.


Data bas e Segmenta tion
„ Less precise than other operations thus less sensitive to redundant and irrelevant features. „ Applications of database segmentation include customer profiling, direct marketing, and cross selling.

Exampl e of Datab ase Segmentati on usi ng a Scatterpl ot


Li nk Anal ysi s
„ Aims to establish links (associations) between records, or sets of records, in a database. „ There are three specializations
ƒ Associations discovery ƒ Sequential pattern discovery ƒ Similar time sequence discovery

„ Applications include product affinity analysis, direct marketing, and stock price movement.


Li nk Anal ysi s As soci ations Di scovery
„ Finds items that imply the presence of other items in the same event. „ Affinities between items are represented by association rules.
ƒ e.g. ‘When a customer rents property for more than 2 years and is more than 25 years old, in 40% of cases, the customer will buy a property. This association happens in 35% of all customers who rent properties’.

Li nk Anal ysi s - Sequent ial Pattern Di sc overy
„ Finds patterns between events such that the presence of one set of items is followed by another set of items in a database of events over a period of time.
ƒ e.g. Used to understand long term customer buying behavior.

Li nk Anal ysi s - Simi lar Tim e Sequence Di scover y
„ Finds links between two sets of data that are time-dependent, and is based on the degree of similarity between the patterns that both time series demonstrate.
ƒ e.g. Within three months of buying property, new home owners will purchase goods such as cookers, freezers, and washing machines.

Devi ati on Detecti on
„ Relatively new operation in terms of commercially available data mining tools. „ Often a source of true discovery because it identifies outliers, which express deviation from some previously known expectation and norm.

Devi ati on Detecti on
„ Can be performed using statistics and visualization techniques or as a by-product of data mining. „ Applications include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing.


Exa mple of Databas e Seg mentati on usi ng a Vi sual izatio n


You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->