Professional Documents
Culture Documents
6/17/11 7:06 AM
http://www.tdan.com/view-articles/4770
Page 1 of 4
6/17/11 7:06 AM
needs. For example, a data warehouse can be used to create rich data sets for statisticians, deliver operational reports, or support operational data stores (ODS) and analytic applications. Moreover, users can query the data warehouse if they need cross-functional or enterprise views of the data. On the downside, a top-down approach may take longer and cost more to deploy than other approaches, especially in the initial increments. This is because organizations must create a reasonably detailed enterprise data model as well as the physical infrastructure to house the staging area, data warehouse, and the marts before deploying their applications or reports. (Of course, depending on the size of an implementation, organizations can deploy all three tiers within a single database.) This initial delay may cause some groups with their own IT budgets to build their own analytic applications. Also, it may not be intuitive or seamless for end users to drill through from a data mart to a data warehouse to find the details behind the summary data in their reports. Bottom-Up Approach In a bottom-up approach, the goal is to deliver business value by deploying dimensional data marts as quickly as possible. Unlike the top-down approach, these data marts contain all the databoth atomic and summarythat users may want or need, now or in the future. Data is modeled in a star schema design to optimize usability and query performance. Each data mart builds on the next, reusing dimensions and facts so users can query across data marts, if desired, to obtain a single version of the truth as well as both summary and atomic data. The bottom-up approach consciously tries to minimize back-office operations, preferring to focus an organizations effort on developing dimensional designs that meet end-user requirements. The bottom-up staging area is non-persistent, and may simply stream flat files from source systems to data marts using the file transfer protocol. In most cases, dimensional data marts are logically stored within a single database. This approach minimizes data redundancy and makes it easier to extend existing dimensional models to accommodate new subject areas. Pros/Cons. The major benefit of a bottom-up approach is that it focuses on creating user-friendly, flexible data structures using dimensional, star schema models. It also delivers value rapidly because it doesnt lay down a heavy infrastructure up front. Without an integration infrastructure, the bottom-up approach relies on a dimensional bus to ensure that data marts are logically integrated and stovepipe applications are avoided. To integrate data marts logically, organizations use conformed dimensions and facts when building new data marts. Thus, each new data mart is integrated with others within a logical enterprise dimensional model. Another advantage of the bottom-up approach is that since the data marts contain both summary and atomic data, users do not have to drill through from a data mart to another structure to obtain detailed or transaction data. The use of a staging area also eliminates redundant extracts and overhead required to move source data into the dimensional data marts. One problem with a bottom-up approach is that it requires organizations to enforce the use of standard dimensions and facts to ensure integration and deliver a single version of the truth. When data marts are logically arrayed within a single physical database, this integration is easily done. But in a distributed, decentralized organization, it may be too much to ask departments and business units to adhere and reuse references and rules for calculating facts. There can be a tendency for organizations to create independent or non-integrated data marts. In addition, dimensional marts are designed to optimize queries, not support batch or transaction processing. Thus, organizations that use a bottom-up approach need to create additional data structures outside of the bottom-up architecture to accommodate data mining, ODSs, and operational reporting requirements. However, this may be achieved simply by pulling a subset of data from a data mart at night when users are not active on the system. Hybrid Approach The hybrid approach tries to blend the best of both top-down and bottom-up approaches. It attempts to capitalize on the speed and user-orientation of the bottom-up approach without sacrificing the integration enforced by a data warehouse in a top down approach. Pieter Mimno, an independent consultant who teaches at TDWI conferences, is currently the most vocal proponent of this approach. The hybrid approach recommends spending about two weeks developing an enterprise model in third normal form before developing the first data mart. The first several data marts are also designed in third normal form but deployed using star schema physical models. This dual modeling approach fleshes out the enterprise model without sacrificing the usability and query performance of a star schema. The hybrid approach relies on an extraction, transformation, and load (ETL) tool to store and manage the enterprise and local models in the data marts as well as synchronize the differences between them. This lets local groups, for example, develop their own definitions or rules for data elements that are derived from the enterprise model without sacrificing long-term integration. Organizations also use the ETL tool to extract and load data from source systems into the dimensional data marts at both the atomic and summary levels. Most ETL tools today can create summary tables on the fly. After deploying the first few dependent data marts, an organization then backfills a data warehouse behind the data marts, instantiating the fleshed out version of the enterprise data model. The organization then transfers atomic data from the data marts to the data warehouse and consolidates redundant data feeds, saving the organization time, money, and processing resources. Organizations typically backfill a data warehouse once business users request views of atomic data across multiple data marts. The major benefit of a hybrid approach is that it combines rapid development techniques within an enterprise architecture framework. It develops an enterprise data model iteratively and only develops a heavyweight infrastructure once its really needed (e.g. when executives start asking for reports that cross data mart boundaries.) However, backfilling a data warehouse can be a highly disruptive process that delivers no ostensible value and therefore may never be funded. In addition, few query tools can dynamically and intelligently query atomic data in one database (i.e. the data warehouse) and summary data in another database (i.e. the data marts.) Users may be confused when to query which database. This approach also relies heavily on an ETL tool to synchronize meta data between enterprise and local versions, develop aggregates, load detail data, and orchestrate the transition to a data warehousing infrastructure. Although ETL tools have matured considerably, they can never enforce adherence to architecture. The hybrid approach may make it too easy for local groups to stray irrevocably from the enterprise data model. Federated Approach The federated approach is sometimes confused with the hybrid approach above or hub-and-spoke data warehousing architectures that are a reflection of a top-down approach. However, the federated approachas defined by its most vocal proponent, Doug Hackneyis not a methodology or
http://www.tdan.com/view-articles/4770
Page 2 of 4
6/17/11 7:06 AM
architecture per se, but a concession to the natural forces that undermine the best laid plans for deploying a perfect system. A federated approach rationalizes the use of whatever means possible to integrate analytical resources to meet changing needs or business conditions. In short, its a salve for the soul of the stressed out data warehousing project manager who must sacrifice architectural purity to meet the immediate (and ever-changing) needs of his business users. Hackney says the federated approach is an architecture of architectures. It recommends how to integrate a multiplicity of heterogeneous data warehouses, data marts, and packaged applications that companies have already deployed and will continue to implement in spite of the IT groups best effort to enforce standards and adhere to a specific architecture. Hackney concedes that a federated architecture will never win awards for elegance or be drawn up on clean white boards as an optimal solution. He says it provides the maximum amount of architecture possible in a given political and implementation reality. The approach merely encourages organizations to share the highest value metrics, dimensions, and measures wherever possible, however possible. This may mean, for example, creating a common staging area to eliminate redundant data feeds or building a data warehouse that sources data from multiple data marts, data warehouses, or analytic applications. The major problem with the federated approach is that it is not well documented. There are only a few columns written on the subject. But perhaps this is enough, as it doesnt prescribe a specific end-state or approach. Another potential problem is that without a specific architecture in mind, a federated approach can perpetuate the continued decentralization and fragmentation of analytical resources, making it harder to deliver an enterprise view in the end. Also, integrating meta data is a pernicious problem in a heterogeneous, ever-changing environment. Summary The four approaches described here represent the dominant strains of data warehousing methodologies. Data warehousing managers need to be aware of these methodologies but not wedded to them. These methodologies have shaped the debate about data warehousing best practices, and comprise the building blocks for methodologies developed by practicing consultants. Ultimately, organizations need to understand the strengths and limitations of each methodology and then pursue their own way through the data warehousing thicket. Since each organization must respond to unique needs and business conditions, having a foundation of best practice models to start with augurs a successful outcome. Top-Down Bottom-Up Hybrid Federated
Major Characteristics
Emphasizes the DW. Starts by designing an enterprise model for a DW. Deploys multi- tier architecture comprised of a staging area, a DW, and dependent data marts. The staging area is persistent. The DW is enterprise- oriented; data marts are function- specific. The DW has atomic- level data; data marts have summary data. The DW uses an enterprisebased normalized model; data marts use a subject- specific dimensional model. Users can query the data warehouse and data marts. Emphasizes data marts. Starts by designing a dimensional model for a data mart. Uses a flat architecture consisting of a staging area and data marts. The staging area is largely non- persistent. Data marts contain both atomic and summary data. Data marts can provide both enterprise and function- specific views. A data mart consists of a single star schema, logically or physically deployed. Data marts are deployed incrementally and integrated using conformed dimensions. Emphasizes DW and data marts; blends top- down and bottom- up methods. Starts by designing enterprise and local models synchronously. Spends 23 weeks creating a high- level, normalized, enterprise model; fleshes out model with initial marts. Populates marts with atomic and summary data via a non- persistent staging area. Models marts as one or more star schemas. Uses ETL tool to populate data marts and exchange meta data between ETL tool and data marts. Backfills a DW behind the marts when users want views at atomic level across marts; instantiates the fleshed out enterprise model, and moves atomic data to the DW. Emphasizes the need to integrate new and existing heterogeneous BI environments. An architecture of architectures. Acknowledges the reality of change in organizations and systems that make it difficult to implement a formalized architecture. Rationalizes the use of whatever means possible to implement or integrate analytical resources to meet changing needs or business conditions. Encourages organizations to share dimensions, facts, rules, definitions, and data wherever possible, however possible.
Pros
Enforces a flexible, enterprise architecture. Once built, minimizes the possibility of renegade independent data marts. Supports other analytical structures in an architected environment, including data mining sets, ODSs, and operational reports. Keeps detailed data in normalized form so it can be flexibly re- purposed to meet new and unexpected needs. Data warehouse eliminates redundant extracts. Focuses on creating userfriendly, flexible data structures. Minimizes back office operations and redundant data structures to accelerate deployment and reduce cost. No drill- through required since atomic data is always stored in the data marts. Creates new views by extending existing stars or building new ones within the same logical model. Staging area eliminates redundant extracts. Provides rapid development within an enterprise architecture framework. Avoids creation of renegade independent data marts. Instantiates enterprise model and architecture only when needed and once data marts deliver real value. Synchronizes meta data and database models between enterprise and local definitions. Backfilled DW eliminates redundant extracts. Provides a rationale for band aid approaches that solve real business problems. Alleviates the guilt and stress data warehousing managers might experience by not adhering to formalized architectures. Provides pragmatic way to share data and resources.
Cons
Upfront modeling and platform deployment mean the first increments take longer to deploy and cost more. Few query tools can easily join data across multiple, physically distinct marts. Requires groups throughout an Requires organizations to enforce standard use of entities and rules. Backfilling a DW is disruptive, The approach is not fully articulated. With no predefined end- state or architecture in mind, it may
http://www.tdan.com/view-articles/4770
Page 3 of 4
6/17/11 7:06 AM
Requires building and managing multiple data stores and platforms. Difficult to drill through from summary data in marts to detail data in DW. Might need to store detail data in data marts anyway.
organization to consistently use dimensions and facts to ensure a consolidated view. Not designed to support operational data stores or operational reporting data structures or processes.
requiring corporate commitment, funding, and application rewrites. Few query tools can dynamically query atomic and summary data in different databases.
give way to unfettered chaos. It might encourage rather than reign in independent development and perpetuate the disintegration of standards and controls.
Major Proponents
Bill Inmon and co- authors Ralph Kimball and co- authors Many practitioners Doug Hackney
Go to Current Issue | Go to Issue Archive Recent articles by Wayne Eckerson Are You Stuck In BI Adolescence? The Business Intelligence Evangelist Wayne Eckerson - Wayne Eckerson has been a thought leader and consultant in the business intelligence (BI) field since 1995. He has conducted numerous in-depth research studies and is a noted speaker and blogger. He is the author of the best-selling book Performance Dashboards: Measuring, Monitoring, and Managing Your Business . For many years, he served as director of education and research at The Data Warehousing Institute (TDWI) where he chaired its BI Executive Summit and created a popular BI Maturity Model and Assessment. Wayne is currently director of research at TechTarget and president of BI Leader Consulting, which provides advisory services to user and vendor organizations. He can be reached at weckerson@techtarget.com.
http://www.tdan.com/view-articles/4770
Page 4 of 4