This action might not be possible to undo. Are you sure you want to continue?
Using a common set of attributes to determine which methodology to use in a particular data warehousing project.
DATA INTEGRATION TECHNOLOGIES
have experienced explosive growth in the last few years, and data warehousing has played a major role in the integration process. A data warehouse is a subjectoriented, integrated, time-variant, and nonvolatile collection of data that supports managerial decision making . Data warehousing has been cited as the highest-priority post-millennium project of more than half of IT executives. A large number of data warehousing methodologies and tools are available to support the growing market. However, with so many methodologies to choose from, a major concern for many firms is which one to employ in a given data warehousing project. In this article, we review and compare several prominent data warehousing methodologies based on a common set of attributes. Online transaction processing (OLTP) systems are useful for addressing the operational data needs of a firm. However, they are not well suited for supporting decision-support queries or business questions that managers typically need to address. Such questions involve analytics including aggregation, drilldown, and slicing/dicing of data, which are best supported by online analytical processing (OLAP) systems. Data warehouses support OLAP applications by storing and maintaining data in multidimensional format. Data in an OLAP warehouse is extracted and loaded from multiple OLTP data sources (including DB2, Oracle, IMS databases, and flat files) using Extract, Transfer, and Load (ETL) tools. The warehouse is located in a presentation server. It can span enterprisewide data needs or can be a collection of “conforming” data marts . Data marts (subsets of data warehouses) are conformed by following a standard set of attribute declarations called a data warehouse bus. The data warehouse uses a metadata repository to integrate all of its components. The metadata stores definitions of the source data, data models for target databases, and transformation rules that convert source data into target data. The concepts of time variance and nonvolatility are essential for a data warehouse . Inmon emphasized the importance of cross-functional slices of data drawn from multiple sources to support a diversity of needs ; the foundation of his subject-oriented design was an enterprise data model. Kimball introduced the notion of dimensional modeling , which addresses the gap between relational databases and
A Comparison of
By Arun Sen and Atish P. Sinha
COMMUNICATIONS OF THE ACM March 2005/Vol. 48, No. 3
Local Metadata Tools Metadata L In the OLAP realm. It questions that managers typically pose. schema. which we survey here and provide useful guidelines for future adopters. The data mart Although the methodologies used by these companies differ in details. bottom-up. a very high-level conceptual Several strategies for schema design exist. architecture design.design. and A dimensional model is composed of a fact table deployment [4. architectures. The the solution for each of the business questions is cre. they all focus on the techniques of capturing and modeling user requirements in a meaningful way. The two most popular data modeling techniques for data warehousing are Entity-Relational and Dimensional modeling. decisionSource Systems support queries may require significant aggregation and joining. These different definitions and concepts gave rise to an array of data warehousing methodologies and technologies. Mart DW L Data RDBMS E Analysis which are linked back into the T Data Local Metadata L Source Analysis Data Systems dimension table with artificial MDB Source Data Operational Analysis Data E Central Systems Mart Analysis Data Store OLTP Central T Metadata keys . including business requirements analysis. Business requirements additive. and hardware and software infrastructure asking the users to rate the questions. translating the ER schema into a relaData warehousing methodologies share a common set tional schema. Business quesArchitecture is a blueprint that allows communications are decision support or analytic tion. planning. starting with a conceptual entity-relationship Tasks in Data Warehousing Methodology (ER) design. such as topmodel (also known as the subject-area data model) of down. the questions.data warehouse architecture design philosophies can ated. brainstorming. learning. the data warehouse. No. Hub-and-spoke Data Mart Architecture Enterprise Warehouse with Operational Data Store Distributed Data Warehouse Architecture by removing the low cardinality Data RDBMS Central Data Data Analysis E Analysis Data Analysis E attributes in the dimensions and T Warehouse E T L Local Metadata RDBMS Data Central T Data Mart L placing them in separate tables. and mixed . maintenance. or simply a star Data Mart Local Metadata schema. To improve analysis is used to elicit the business Different types of performance. data design. 3 COMMUNICATIONS OF THE ACM . they are prioritized by design.multidimensional databases needed for OLAP tasks. 48. inside-out. A dimensional model can be extended to a snowflake schema. 80 March 2005/Vol. and then normalizing the relational of tasks. Next. technical business questions are elicited. implementation. and several dimension tables . denormalization is usually promoted in questions from the intended users of data warehouse a data warehouse environment. After all the includes different areas such as data design. The data design task includes data modeling and normalization. The conceptual model serves as the blueprint for be broadly classified into enterprisewide data warethe data requirements of an organization. 9]. The Entity-Relational modeling follows the standard OLTP database design process. and JAD sessions tains attributes whose values are generally numeric and are used to elicit requirements. techniques cialized relation with a multi-attribute key and consuch as interviews. A fact table is a speFor business requirements analysis. and reuse. The characCentral Data Mart Analysis E E Warehouse T T Local Metadata teristic star-like structure of the Data Mart L L RDBMS RDBMS physical representation of a Financial Data Mart Data Source Local Metadata dimensional model is called a star Central Analysis Systems Source Metadata MDB Systems Human Resources join schema. The architecture design philosophy has its oriing the risk associated with the solutions needed for gins in the schema design strategy of OLTP databases. house design and data mart design . A dimension table has a single attribute primary key (usually surrogate) that corresponds exactly to one of the Enterprise Data Warehousing Architecture Data Mart Architecture attributes of the multi-attribute RDBMS Sales Data Data key of the fact table. or by estimat.
JAD Interview Interview document follows the mixed (top-down as Modeling Prioritization. He advocates the Kimball et al.Core Competency Intelligence and software software and OLAP Middleware server tion activities include data sourcRequirements Interview. data warehouse development should be driven data-driven requirements analysis” . Dimensional model.Table 1.” The data warehouse for the Philosophy entire organization is the union of Implementation Iterative Dimensional Life Iterative Iterative (RAD) Iterative Cycle (prototyping) those conformed data marts. . Next. uses Oracle Yes. The focus is on by data. Uses an Yes. data transform runtime Table 2. Data warehouse metaon the DBMS Allows parallelism Not reported Not reported Allows parallelism data includes definitions of con. integrated. Dimensional model.Query Design Depends to be used at the via partitioning warehouse level formed dimensions and Yes Yes Yes Yes Yes conformed facts. JAD. oriented end-user applications. Uses integrated Yes. model. The goal is to create Data Modeling ERD. heterogeneous OLTP sources. to Terabytes Terabytes including a hub-and-spoke archiHas post audit reviews. Comparison of infrastructure-based logs. uses a Yes. model. JAD. DBMS load Change Management scripts. The approach is iterative tems Development Life Cycle (SDLC). Data Modeling ERD. Uses its Yes. relational schema Dimensional model. analytic requirements that are elicited from business COMMUNICATIONS OF THE ACM March 2005/Vol. JAD. results of the programs are analyzed. 48. uses a repository Yes. and tested. uses a repository Yes. Comparison of core technology vendor-based data warehousing methodologies. Dimensional ERD. Data is first gathered. Star schema. and other types of metadata data warehousing methodologies. programs are written against the data and the For data warehouse implementation strategy. Relational schema Star and Snowflake Star schema Star schema schemas individual data marts in a botfor Develops all relations Allows both Allows both More slanted toward Allows both tom-up fashion. Because of the size of metadata. analysis templates.’s business dimensional life-cycle reverse of SDLC: instead of starting from require. The Strategy Yes. Star schema Star schema schema These activities depend on two for Not reported Not reported Not reported Allows both Allows both things—data quality manage. data cleansing Scalability Very little Very little Yes Uses Visible tools Not reported specifications. ments. Analyze data sources analysis inventory. data quality management is a very important issue. uses Microsoft figure on the preceeding page Metadata Management Repository repository Repository depicts several variants of the Query Design Parallel query Allows parallel Not reported Not reported Allows parallel development queries queries basic architectural design types. every data warehouse should be equipped with some type of metadata management tool.requirements are formulated. Uses its own Yes Management metadata management integrated own repository repository metadata platform DBMS. JAD.document analysis development of decision supportDimensional ERD.Support Normalization/ Denormalization ment and metadata management Enterprise data Enterprise Enterprise Enterprise Enterprise [5. Interview. and Modeling subject areas analysis templates. data design. Interview. Requirements Interview. As data is gathered from Architecture Design warehouse and data data warehouse data warehouse data warehouse data warehouse Philosophy marts with data marts with data marts with data marts with data marts multiple. Attributes NCR/Teradata Methodology Teradata DBMS (massively parallel DBMS) Oracle Methodology Oracle DBMS IBM DB2 Methodology DB2 DBMS Sybase Methodology Sybase DBMS Microsoft SQL Server Methodology SQL Server DBMS Core Competency design. the Inmon  advises against the use of the classical Sys. Interview. known as the waterfall approach. Finally. enterprise warehouse with Change Management but not emphasized in the methodology the methodology operational data store (real-time access support). espoused by Kimball . Data analytics Data analytics Business Business analysis Business analysis Data warehouse implementa. and data sources ing. Relational schema model. and distributed Attributes SAS Methodology Informatica’s Visible Hyperion’s Computer enterprise data warehouse architecVelocity Technologies’ STAR Associates’ Methodology Methodology Methodology Methodology ture . Prioritization. document Business process Interview. No. Star ERD. allows denormalization mance with a skeleton schema Denormalization denormalization Enterprise data Data marts Enterprise data Data marts Enterprise data known as the “data warehouse Architecture Design warehouse warehouse and warehouse and data marts data marts bus. Star schema model. which is also in nature. data staging (ETL). JAD. A data Implementation Iterative Iterative spiral Iterative Iterative Iterative (Piloting/ warehouse generates much more Strategy prototyping ) metadata than a traditional Metadata Yes. but in confor.Support Normalization/ as normalized. to hundreds of Not reported Yes Not reported Yes. 3 81 .approach “differs significantly from more traditional. Dimensional Warehouse model. Scalability Yes. document prioritization. 7]. document subject areas analysis well as bottom-up) strategy of Dimensional ERD. Not reported Not reported Has maintenance in Not reported tecture.
JAD. infrastructure vendors. architecture design. model. or to maintenance. The core-technology vendors are those data warehousing methodologies. to help extract. are more based methodology. 82 March 2005/Vol. physical design. and other factors. and other phases. Attributes SAP Methodology PeopleSoft Methodology ERP CGEY Methodology General business consulting as one of the leading causes of data warehouse failures. an iterative (spiral) ologies we review include NCR’s Teradata-based approach such as prototyping is usually adopted. Due to increased end-user enhancements. or are too difficult/expensive to change with the evolving needs of the business. These vendors the business requirements a use data warehousing schemes that take advantage of priori. tion. Surprisingly. so the SDLC (waterfall) approach is not viable. Oracle and Peoplesoft) Change management is an important issue to consider in selecting a data warehousing methodology. whose methodologies could Table 3. business processes.manage metadata using repositories. IT consulting Business Intelligence consulting Interviews. Interview templates Interview Follows varied approach (SAP. 48. No. data granularities. Predefined data Dimensional ERD/Object model. Microsoft. Scalability Yes Integrated and Yes Yes Yes scalable open Core Competency Attribute. Based on the data wareat the requirements level housing tasks described earlier. and load data into the data warehouse. Iterative (RAD) and information modeling comStrategy used by the Iterative (spiral) type chosen panies. in the data warehouse realm could be a mechanism to The deployment task focuses on solution integra. The life-cycle approach starts with project planning and is followed by business requirements definition. the nuances of their database engines. Sybase’s methodology. a data Corporate Creative Data warehouse usually goes through Information Designs Methodology Methodology several versions. dimensional modeling. Relational schema Star schema which we believe are fairly repreDimensional Star schema model. 3 COMMUNICATIONS OF THE ACM . it is they are in. The architecture first attribute we consider is the Change Different modeling Allows impact Not reported Not reported Not reported Management methods for tracking analysis core competency of the compahistory nies. maintenance is cited tools typically work with a variety of database engines. and amenable to a phased development approach such as Microsoft’s SQL Server-based methodology. and data warehouse transfer. business dimensional life cycle because they focus on The second category. on the other hand. Warehouses fail because they do not meet the needs of the business. Core Competency Requirements Modeling ERP Comparing Data Warehousing Methodologies We analyzed 15 different data warehousing methodologies. data warehouse tuning. deployment. we Metadata Integrated meta data Yes Not reported Yes Not reported present a set of attributes that capManagement repository ture the essential features of any Query Design Allows ad hoc queries Allows ad hoc Allows ad hoc Not reported Not reported queries queries data warehousing methodology. The methodTo elicit the requirements. Although solution integration and data help create end-user solutions. repeated schema changes. Normalization/ denormalization strategies Denormalization The sources of those methodoloArchitecture Enterprise data Enterprise data Data marts Enterprise data Enterprise data gies can be classified into three Design warehouse and data warehouse and data warehouse and data warehouse and data Philosophy marts marts marts marts broad categories: core-technology vendors. An infrastructure tool prisewide warehouse. Data Modeling Dimensional model.managers/executives to design dimensional data marts. impractical to determine all companies that sell database engines. very few vendors incorporate change management in their methodologies. Dimensional Model. Comparison of For enterprisewide data have different emphases depending upon the segment information modeling-based warehouse development. Oracle’s methodology. infrastructure vendors.house infrastructure business. Implementation Iterative (prototyping) SDLC Follows steps SDLC (waterfall). which are much smaller in scope includes those companies that are in the data wareand complexity than the requirements for an enter. IBM’s DB2Individual data marts. Star schema sentative of the range of available Support for Allows Not reported Follows multiple Not reported Not reported methodologies (see Tables 1–3). document analysis Subject areas. methodology. The infrastructure warehouse tuning are essential. etc. Extended star schema warehouse model.
This attribute focuses on data modeling techniques that the methodologies use to develop logical and physical models. Implementation Strategy Attribute. Informatica’s methodology. Large data warehouse tables take a long time to process. The third category. and Hyperion. Other vendors listed in tables 2 and 3 COMMUNICATIONS OF THE ACM March 2005/Vol. This is a predominant feature in NCR’s Teradata DBMS and is therefore included in its methodology. an information model (also called a warehouse model) is created based on those requirements. star schema. Vendors like Microsoft and Oracle allow parallel queries. a conceptual ERD is first translated into a dimensional model. No. are DBMS-independent. SAS. Such methodologies include SAS’s methodology. To support OLAP queries. ranging from enterprisewide data warehouse design to data mart design. Sybase. The normalization/denormalization process is an important part of a data warehousing methodology. providing strong support for parallel query processing. therefore. put a lot of emphasis on capturing business requirements and developing information models based on those requirements. Data warehousing is a technology service for most consulting companies. Query Design Attribute.The methodologies proposed in this category. especially if they must be joined with others. Computer Associates. Depending on the methodology. Oracle and Informatica’s creation of subject areas. The model is logically represented in the form of an ERD. Some DBMS vendors (Oracle. Metadata Management Attribute. and two IT/data-warehouse consulting companies (Corporate Information Designs and Creative Data). most vendors have adopted the iterative prototyping approach. 3 83 . several methodologies use streamlining tricks. Therefore. a dimensional model. including IBM. Visible Technologies’ methodology. the implementation strategy could vary between an SDLC-type approach and a RADtype approach. and Visible Technology) have an edge because they have their own repository systems to manage metadata. As this elicitation process is fairly unstructured. ranging from standard systems development life-cycle techniques such as interviews and observations to JAD sessions. Architecture Design Philosophy Attribute. relational databases require frequent table joins. information modeling vendors. Teradata. The organization needs to determine which approach will be the most suitable before adopting a methodology. To improve query performance. Although the methodologies used by these companies differ in details. A number of strategies are available for designing a data warehouse architecture. a very important aspect of data warehousing. Various types of requirements elicitation strategies are used in practice. and Hyperion’s methodology. In the Sybase methodology. Other vendors. IBM. includes ERP vendors (SAP and PeopleSoft). therefore. SAP. and Microsoft) and some infrastructure vendors (Informatica. and NCR/Teradata and Sybase’s templatedirected elicitation. Some DBMS vendors allow parallel query generation and execution. Data Modeling Attribute. possibly due to the fact that they depend on the DBMS to be used. understanding and representing user requirements accurately is very important. Data warehouse methodologies. Because query performance is an important issue. or snowflake schema during physical design. Oracle. We found that all DBMS vendors explicitly support the denormalization activity. including general ones like Cap Gemini Ernst Young (CGEY) or specific ones like Corporate Information Designs and Creative Data. NCR/Teradata. the core competency of this category is information modeling of the clients’ needs. Requirements Modeling Attribute. some vendors place a lot of emphasis on how queries are designed and processed. We include ERP vendors because data warehousing can leverage the investment made in ERP systems. Support for Normalization/Denormalization Attribute. 48. Other vendors listed in Tables 2 and 3 do not report this capability much. We group the ERP and consulting companies into one category because of the similarities in their objectives. For building a data warehouse. Computer Associates’ Platinum methodology. Examples include NCR/Teradata’s elicitation and prioritization of business questions. a methodology must support denormalization. a general business consulting company (Cap Gemini Ernst Young). Teradata is a truly parallel DBMS. but process them in a conventional fashion. Almost all vendors focus on metadata management. The logical model is then translated into a relational schema. Within the RAD category. use the dimensional model for logical design and the star schema for physical design. or some other type of conceptual model (such as an object model). and Informatica provide examples of methodologies that map an ERD into a set of normalized relations. they all focus on the techniques of capturing and modeling user requirements in a meaningful way. Once the requirements are captured. which can be very costly. This attribute focuses on techniques of capturing business requirements and modeling them.
As the industry matures. In Teradata. 4. and evaluated. It is apparent that the core vendor-based methodologies are appropriate for those organizations that understand their business issues clearly and can create information models. Kimball.inmoncif. very few vendors incorporate change 84 March 2005/Vol. note that scalability is highly dependent on the type of DBMS being used. acquisition is a normal strategy for growth. Surprisingly. 9. www. With advances in portal technology. Understanding and Implementing Successful Data Marts. The dilemma of change: Managing changes over time in the data warehouse/DSS environment..edu) is a full professor and Mays Fellow in the Department of Information and Operations Management in Mays Business School at Texas A&M University. If the focus is on the infrastructure of the data warehouse such as metadata or cube design. changes in object definitions. there could be a convergence of the methodologies. S. W. MA. 1992. Company divestiture is also another source of changes for any enterprise. Change management is an important issue to consider in selecting a data warehousing methodology. New York. The Visible Technologies methodology strongly focuses on change management and has tools to support this process. As the data warehouse implementation effort progresses. None of the methodologies reviewed in this article has achieved the status of a widely recognized standard as yet. and deletion of obsolete front-end objects. expansion of bandwidth. or adding new data sources. c References 1. W. Atish P. but has a less severe impact on a data warehouse. to republish. recorded. there could be various changes to the front-end interface. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. Oracle. M. or SQL Server. www. specialized technical support. MA. For example. 2002. New York. similar to what happened with database design methodologies. With OLAP front-end tools. additional user requests and enhancements will inevitably arise. Inmon. Colorado. Ross. Although all methodologies support scalability.. White Paper. Wiley. D. 2002. Reeves. M.pdf. W. and Meers. to post on servers or to redistribute to lists. 3rd edition. Addison-Wesley. it is advisable to use the infrastructure-based methodologies. 48.. Ceri.inmoncif. C. Products get assigned to new categories. 3. www. The Data Warehouse Lifecycle Toolkit. and efforts to standardize models.H. © 2005 ACM 0001-0782/05/0300 $5. for example. L. For a large number of enterprises in today’s economy. Organizations should consider this issue before selecting a data warehousing methodology. Metadata in the data warehouse.. 1997. White Paper. 7.tamu. Benjamin/Cummings. An intelligent enterprise should be able to manage and evaluate its business processes. D. Sybase. However. Sinha (sinha@uwm. Wiley. Teradata does not economically scale down below a terabyte. Changes in the physical world also affect the data warehouse. Inmon. Newer technologies could also affect the way an e-commerce site is set up and introduce changes.kalido. An acquisition or a merger could have a major impact. Various changes affect the data warehouse . Wiley. 5. An example of a process change is introducing new data cleansing routines. Change Management Attribute. the cost of the proprietary hardware. W.dciexpo. Redwood City. scalability can be achieved by adding more disk space. such as addition of new front-end objects initially not available. Inmon. 2000. Conclusion Data warehousing methodologies are rapidly evolving but vary widely because the field of data warehousing is not very mature. To copy otherwise. and other related activities. Arun Sen (Asen@cgsb.depend on the DBMS they use.. which would necessitate managing additional load scripts. load map priorities. and specialized data loading utilities in Teradata result in higher overhead and development costs than DB2. 2001.K. R. 8. W. and Navathe. Those changes need to be handled. Inmon. firms could be reconfiguring their Web sites. 6. the organizations should adopt the informationmodeling based methodologies. No.00 . Hackney. R. while in others. For a data warehouse project. Building the Data Warehouse. Scalability Attribute. and Thronthwaite. Kimball. Sales regions get reconfigured. CA. White Paper. 1999. and backup scripts. www. thereby initiating a lot of changes. Andover.com/library/whiteprs/earlywp/ttmeta. Sometimes it is important to capture those changes in the warehouse for future analyses. Tech Topic 10. customers frequently change their addresses. 1998.P. requires prior specific permission and/or a fee. 2nd edition.com/library/whiteprs/techtopic/tt10. increasing the size may require considerable effort. it could imply rescoping of warehouse development. DCI. and Ross. Batini. 3 COMMUNICATIONS OF THE ACM management in their methodologies. 1997. 2. redefining business objectives. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. S. Otherwise. Reading.edu) is an associate professor in the School of Business Administration at the University of WisconsinMilwaukee.com. Conceptual Database Design: An Enity-Relationship Approach. it is usually masked as maintenance. replanning priorities.com. Changes in process are part of the natural evolution of any enterprise. Metadata in the data warehouse: A statement of vision.pdf. Pine Cone Systems. When they do. New York. DCI Seminar Workbook—Strategies and Tools for Successful Data Warehouses.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.