This action might not be possible to undo. Are you sure you want to continue?
Why Data Warehouse? Definition
Data mining Subject oriented
• Describe the purpose of a data warehouse. • Describe the characteristics of a data warehouse. • Explain the relationship between a data warehouse and an operational database. • Explain the architecture of a data warehouse. • Explain the related technologies of a data warehouse
Previous slide: Lesson Map
Why Data Warehouse?
Operational database Data warehouse
• Supports day-to-day business operations
• Stores operational data • Competitive advantage thru efficient and cost effective services to the customer • How much revenue was generated during by each customer during this year?
• Stores business data
• Data are structured to help decision makers in making strategic decisions • What was the revenue for each quarter of this year by geographic region and customer? • What are expected sales by region next year?
Previous slide: Learning Outcome
Pub. integrated. W. presented Previous slide: Why Data Warehouse? 4 . and Nonvolatile. A data warehouse is a subject-oriented. time-variant.H. Building the Data Warehouse Wellesley. 1992 Data prepared. MA: QED Tech.Introduction Definition A Data Warehouse is . collection of data in support of management’s decisions Inmon.. organized. Group..
and is grouped under business-oriented subject headings. such as – customers – products – sales rather than application oriented data. Savings Account System Investment Account System Checking Account System Previous slide: Definition of Data Warehouse 5 .CHARACTERISTICS … subject-oriented . • The data in the warehouse is defined and organized in business terms...
. • The data warehouse contents are defined such that they are valid across the enterprise and its operational and external data sources Data warehouse Operational systems • The data in the warehouse should be – clean – validated – properly integrated .CHARACTERISTICS … integrated ..
• On the contrary... Previous slide: Integrated 7 . since past values are not of interests. operational data is overwritten. • This chronological recording of data provides historical and trend analysis possibilities.CHARACTERISTICS … time-variant . • All data in the data warehouse is timestamped at time of entry into the warehouse or when it is summarized within the warehouse.
• Data acts as a stable resource for consistent reporting and comparative analysis. deleted.. operational data is updated (inserted. modified). • Once loaded into the data warehouse. the data is not updated. Change Insert Insert Replace Replace Previous slide: Time variant 8 Change Access Load .. • On the contrary.CHARACTERISTICS … nonvolatile .
An Example of Data Integration Checking Account System Jane Doe (name) Female (gender) Bounced check #145 on 1/5/95 Opened account 1994 Savings Account System Jane Doe F (gender) Opened account 1992 Investment Account System Jane Doe Owns 25 Shares Exxon Opened account 1995 Previous slide: Non volatile Operational data Customer Jane Doe Female Bounced check #145 Married Owns 25 Shares Exxon Customer since 1992 warehouse 9 data .
An Architecture for Data Warehousing metadata USER1 OLAP USER2 external sources extraction cleaning validation summarize. data warehouse used by data mining USER3 operational databases data mart Previous slide: Data Integration query 10 .
Codd (1993) in contrast to On-Line Transaction Processing (OLTP) • The OLAP Council’s definition: “A category of software technology that enables analysts. consistent. managers and executives to gain insight into data through fast.On-Line Analytical Processing (OLAP) • Term introduced by E. interactive access to a wide variety of possible views of information that have been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user” .F.
Market Product . • Data used in OLAP should be in the form of a multi-dimensional cube.On-Line Analytical Processing (OLAP) • Basic idea: users should be able to manipulate enterprise data models across many dimensions to understand changes that are occurring.
Dimensional Hierarchies • Each dimension can be hierarchically structured Year Country Type of product Month State Product Week City Item Day Store .
time] • Pivot: re-orienting the multidimensional view of data sales [market.OLAP Operations • Rollup: decreasing the level of detail • Drill-down: increasing the level of detail [time: year month week day] • Slice-and-dice: selection and projection sales [product. time] . product. market.
It is long and thin.• Multi-dimensional databases (MDDB) • To make relational databases handle multidimensionality. two kinds of tables are introduced: – Fact table: contains numerical facts. Implementing Multidimensionality . They show where the information can be found. and wide. – Dimension tables: contain pointers to the fact table. short. Dimension tables are small. A separate table is provided for each dimension.
Region ID Region Desc. Regional Mgr.Star Schema Market Dimension STORE KEY Store Desc. Year Quarter Month Day STORE KEY PRODUCT KEY PERIOD KEY Dollars Units Price Product Dimension PRODUCT KEY Product Desc. City State District ID District Desc. Level Fact Table Time Dimension PERIOD KEY Period Desc. Brand Color Size Manufacturer .
ROLAP. • OLAP tools access and analyze multidimensional data (typically three. . up to ten-dimensional data). • DSS applications are tools that access and analyze data in relational database (RDB) tables. DSS • The OLAP technology is considered an extension of the original DSS technology. • OLAP technology is called MOLAP/ROLAP (multidimensional/relational OLAP) if it uses an MDDB/RDB.MOLAP.
• OLAP tools require strong interaction from the users to identify interesting patterns in data. . • OLAP users are “farmers”.OLAP/DSS • OLAP tools focus on providing multidimensional data analysis. • An OLAP tool evaluates a precise query that the user formulates. that is superior to SQL in computing summaries and breakdowns along many dimensions.
There is also an open source ROLAP server Mondrian. . Oracle BI (the former Siebel Analytics).OLAP Tools ROLAP Microsoft Analysis Services (Microsoft). MicroStrategy and SAP AG BI Accelerator. MOLAP Essbase.There is also an open source MOLAP server Palo.French BusinessObjects. HOLAP Microsoft Analysis Services. MicroStrategy 8 (Microstrategy) and BusinessObjects XI . MIS Alea (Systems Union) and TM1 (Applix Inc) .
cs. by region. Mondrian is used for: High performance.Mondrian • Mondrian is an Open Source OLAP (online analytical processing) server.brown.edu/courses/cs227/Papers/Visualizat ion/Choong. It supports the MDX (multidimensional expressions) query language and the XML for Analysis and JOLAP.pdf . for example analyzing sales by product line. by time period Parsing of Multi-Dimensional eXpression (MDX) language into Structured Query Language (SQL) to retrieve answers to dimensional queries High-speed queries through the use of aggregate tables in the RDBMS Advanced calculations using the calculation expressions of the MDX language • • • • • • http://www. interactive analysis of large or small volumes of information "Dimensional" exploration of data. It reads from SQL and other data sources and aggregates data in a memory cache. written in the Java programming language.
DB2.DBMS for Warehouse • Multidimensional DBMS Essbase. MySQL. SG server. Firebird. PostgreSQL. UniVerse • Relational DBMS Oracle. .
Data mining applications • Marketing • Identifying buying patterns of customers • Predicting response to mailing campaigns • Banking • Identifying loyal customers • Determining credit card spending by customer groups • Insurance • Claims analysis • Predicting which customers will buy new policies . comprehensible and actionable information from large databases and using it to make crucial business decisions.DATA MINING Definition : The process of extracting valid. previously unknown. Data mining assists business analysts with finding patterns and relationships in the data — it does not tell you the value of the patterns to the organization.
Data mining operations and associated techniques Operations Predictive modeling Database segmentation Data mining techniques Classification Value prediction Clustering Link analysis Deviation detection Association discovery Sequential pattern discovery Statistics Visualization .
Classification Classification is a data mining (machine learning) technique used to predict group membership for data instances. you may wish to use classification to predict whether the weather on a particular day will be “sunny”. “rainy” or “cloudy”. . Popular classification techniques include decision trees and neural networks. For example.
purchase of a house. Example: 65% of the time. young age group Association discovery Are occurrences that are linked to a single event Example: Supermarket: purchase beer and buy peanuts 55% of the time Sequential pattern discovery Occurs where events are linked over time. • Clustering divides a database into different groups. . followed by a purchase of curtains after two months. The goal of clustering is to find groups that are very different from each other. that is one event leading to another later event. Example: splitting the database by age groupings customer old age group. and whose members are very similar to each other.Clustering • Clustering is a data mining (machine learning) technique used to place data elements into related groups without advance knowledge of the group definitions.
Forecasting • Is used to discover pattern in the data can lead to predictions about the future. Example: quality control .Value prediction . which express deviation from some previously known expectation and norm. Example: Projection of sales in the next 12 months Deviation Detection Deviation detection is often a source of true discovery because it identifies outliers.
time series sequential] • Scenario. IBM. time series] • Intelligent Miner . Cognos [decision trees] . [decision trees.Data mining tools • Enterprise Miner . association. linear model. SAS Institute [decision trees. association. linear model.
changing. fast. needs multi-level and multi-dimensional processing Goal: Mine patterns. ordered. process queries and compute statistics on data streams in real-time . possibly infinite • Fast changing and requires fast. real-time response • Data stream captures nicely our data processing needs of today • Random access is expensive • single scan algorithm (can only have one look) • Store only the summary of the data seen thus far • Most stream data are at pretty low-level or multi-dimensional in nature.Data Stream Mining Characteristics of Data Streams Data Streams vs DBMS Data streams continuous. persistent data sets Characteristics • Huge volumes of continuous data. huge amount Traditional DBMS data stored in finite.
RFIDs • Security monitoring • Web logs and Web page click stream .Applications of Stream Data Mining What are the Applications? • Telecommunication calling records • • • • Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply & manufacturing • Sensor. monitoring & surveillance: video streams.
org/ http://www.html – http://www.nl/awcourse/oracle/serve r.org/wiki/Data_warehouse http://www.920/a96520/concept.com/datawarehousing/dataw arehouse.Useful Websites • Data warehouse – – – – http://en.1keydata.wikipedia.html – http://www.com/datawarehousing/conce pts.leidenuniv.1keydata.dwinfocenter.datawarehousingonline.com/ http://www.htm .lc.
FOR/ course.htm http://www.the-datamine.com/text/dmwhite/dmwhit e.utexas.ucla.thearling.mat/Alex/ http://www.edu/~norman/BUS.Useful Websites • Data mining http://www.fra nd/teacher/technologies/palace/datamining.com/bin/view/Misc/DataMiningTutorials .org/tutorials/ http://www.h tm http://www.eco.autonlab.edu/faculty/jason.anderson.
org/sigmod/record/issues/0506/p 18-survey-gaber.research.csse.public.edu.html • http://citeseer.monash.html • http://www.Useful Websites .csse.org/wiki/Data_stream_mining • http://domino.pdf • http://www.nsf/ pages/r.kdd.htm • http://www.sigmod.edu/640620.com/comm/research.edu/~huanliu/CFP/CFPMiningS treamData.au/~mgaber/WResourc es.edu.html .pdf • http://en.monash.asu.ist.Data stream mining • http://www.psu.wikipedia.au/~mgaber/CameraRe adyPAKDD.ibm.innovation.
Summary • • • • Introduction to data warehouse Characteristics of a data warehouse Data warehouse architecture OLAP and data mining Next lesson Data warehouse and data mining tools Previous slide: Architecture 34 .
Words of encouragement .