You are on page 1of 7

Research

Publication Date: 31 August 2005 ID Number: G00130021

The Third Normal Form Is the Base of Your Data Warehouse


Mark A. Beyer

To successfully deploy a future extensible enterprise data warehouse, the data warehouse modeler, data architects and database administrators must understand the importance of a Third Normal Form data layer that supports the data warehouse.

2005 Gartner, Inc. and/or its Affiliates. All Rights Reserved. Reproduction of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Although Gartner's research may discuss legal issues related to the information technology business, Gartner does not provide legal advice or services and its research should not be construed or used as such. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The opinions expressed herein are subject to change without notice.

WHAT YOU NEED TO KNOW


Data warehouse project teams need to recognize that a Third Normal Form model is an architectural component that enables the data warehouse. The 3NF base enables or encourages rapid integration for operational analytics, future extensibility, single-point maintenance and enduser tool flexibility.

ANALYSIS
One of the primary issues facing designers early in data warehouse planning involves choosing between Third Normal Form (3NF) and denormalized star schemas. A 3NF data warehouse schema is a set of data entities based on the principles of data normalization to the third form (see Table 1). Most data systems never achieve full 3NF; however, they "approach" it. The definition of a 3NF data warehouse lies in the compromises made between Second Normal Form (2NF) and 3NF deployments (see Note 1). The 3NF approach generally does a better job of meeting the total data warehouse service-level agreement (SLA), instead of concentrating only on performance (see Note 2). Table 1. Forms of Data Normalization
Form First Rule of Normal Form Second Rule of Normal Form Third Rule of Normal Form Definition Remove redundant data from horizontal rows. All data should be held in columns and rows. Remove redundant data in vertical columns. Values uniquely identify each row in each table. Remove data values independent of primary row keys. Each table contains unique data.

Source: Wikipedia.org/wiki/database_normalizatoin (August 2005)

Deciding whether to use star schemas or 3NF often becomes complicated, because star schema deployments are also between 2NF and 3NF design. When adapting an online transaction processing database for a data warehouse, star schema approaches deliver better performance by design than the 3NF approach. The primary difference between a star deployment and a 3NF deployment is that stars rely on summary and aggregate analysis outcomes based on expected requirements. The star approach assumes that most of the questions are known at design time. A 3NF data warehouse relies on the definition of the components that make up the expected analysis. It maintains the components as separate data entities for the ease of evaluating as many analytic outcomes as possible. The 3NF approach assumes that the inputs to any question should be captured because the questions to the data are unknown at design time. Unfortunately, many devotees of either approach avoid the fact that stars often have 3NF staging areas and that 3NF models often require some type of optimization frequently a star. Benefits of 3NF The benefits of using 3NF include high flexibility. Data Detail Flexibility

Publication Date: 31 August 2005/ID Number: G00130021 2005 Gartner, Inc. and/or its Affiliates. All Rights Reserved.

Page 2 of 7

Dimensional Flexibility. 3NF design assumes that dimensions and the hierarchies within them will change in an unpredictable pattern. It supports data as a dimension in one analysis model and as a fact in a separate analysis. Most data models remain valid and extensible in an industry paradigm. At some point, even the best 3NF, highly generic entity will be interpreted as a restrictive role-based entity (for example, "store" instead of "retail outlet"). Abstract Entity Definition. Abstract entities correctly use established tables for various concepts. This approach puts a lot of responsibility on the data steward and governance bodies, while easing database maintenance. The first iteration of the 3NF warehouse has more-abstract table names, as opposed to role-based names (for example, "party" instead of "customer"), and it overcomes the objection that 3NF models result in table proliferation. For example, an entity named "building" could be deployed instead as two tables "structure" and "structure type" making it possible to place the record for a multistory office building in the same table as a park fountain. Impact on Real-Time Analytics. As operational systems seek input for real-time or operational analysis, they will require detailed data. Aggregates will work in this respect. However, summaries break down because the data values in summaries vary widely from online transaction processing detail rows. The 3NF data warehouse approach maintains data at a grain that matches or closely resembles operations transaction data.

Data Mart Management A dependent data mart strategy creates a single maintenance point for all of the analytics data in an organization. However, the star approach greatly enables independent marts. Conversely, the 3NF approach provides strong discouragement for independent data marts. ! A 3NF warehouse uses shared dimensions (see "Use the Star Schema as an Optimization Layer in Your Data Warehouse Model"), which are single instances of tables that are used in multiple query analyses and across multiple subject areas. Because all data in the 3NF model is relational, once a table is established as the repository for a specifically defined row of data, there is no need to replicate the data, other than for optimization to dependent marts. Star schema approaches can also support single maintenance points when using dependent data marts. Extraction, transformation and loading (ETL) strategies that load a 3NF model represent a single point of requirements definition, data maintenance and program administration. All processing to the 3NF model is concerned first with moving the data in its cleanest form to a table that closely resembles the grain and column detail of a single corporate, logical model, and is usually significantly different from the source model. Once that data processing stream is defined and deployed, there is no need to move the data from the source and introduce the potential for disparate row selection, value processing or data warehouse load strategies. Again, a star schema approach for dependent data marts can also support this position.

Promoting a Unified Data Version Unsynchronized versions of corporate data must be avoided. Star and 3NF approaches are threatened with the issue of multiple data versions. However, unlike the star approach, 3NF approaches discourage the ETL architect from taking ownership of data transformations near the presentation level. ! A 3NF approach permits tools to access the model directly or deploy views and even tables that mimic stars if the front-end tool requires. Thus, even the 3NF warehouse is

Publication Date: 31 August 2005/ID Number: G00130021 2005 Gartner, Inc. and/or its Affiliates. All Rights Reserved.

Page 3 of 7

optimized, where appropriate, by the star schema. A star approach cannot restate data into a 3NF model for the benefit of the front-end tool. The 3NF model is independent from front-end tool modeling requirements (see "Data Warehouses Need to Be Designed for More Than BI Tools"). ! All dependent data marts can connect to the same 3NF model and aggregate or summarize data as desired for each constituent user group. The 3NF warehouse will have the data at the level of detail needed for any given analysis, because it exists at the lowest grain available from the sources. The star schema can only provide data at the predetermined level. Data quality processing to load the 3NF warehouse provides significant insight into a single set of cleansing rules that produced the data and, at minimum, provide a common starting point for parallel analysis. The 3NF warehouse has a single data ETL architecture, a single data quality strategy, a single logical data model, and a shared aggregation and summarization strategy. The 3NF data warehouse can deploy optimization layers that include summarization when the summary approaches exhibit a common use throughout the organization. For example, when accounting needs summary data at the same time customer service needs it, the summary rows (which are much less than 3NF) can be deployed in the 3NF data warehouse. As an added advantage, the dimensional data that is attached to such summaries becomes a shared dimension by default.

! !

The star and 3NF approaches support the understanding of data beyond the data acquisition context. The caution here is that 3NF does not support this process any better than a star approach it merely highlights the issue more forcefully. Risks and Mitigation of 3NF Time-to-Value Deploying a 3NF schema challenges the time-to-value cycle. ! In general, 3NF deployments require multistaged ETL processes that can be as simple or as complex as the system architect desires. In a 3NF model, the data is loaded at the detail level, and then summary or aggregate tables are created as needed, purely for performance optimization reasons (usually as stars). The wise 3NF deployment will include some initial summary and aggregate design, at least in the database view layer to ensure that end-user acceptance testing can proceed with good performance. Front-end business intelligence (BI) tools that access 3NF data usually include an architect's layer to reconfigure the data for presentation to the user. This layer has to be designed, tested and deployed almost independently of the database design, and then connected to the database once the final 3NF design is completed. This effort can be disconnected and run in parallel to database design and population.

End-User Comprehension 3NF schema designs are not easily comprehended by end users, relative to star schema designs. Thus, validation of requirements becomes more complex. ! Users do not think of data as the assembly of many parts. They think of data as information with context. A 3NF deployment identifies data entities that can exist in multiple contexts and, thus, the definitions attempt to remove commonly understood

Publication Date: 31 August 2005/ID Number: G00130021 2005 Gartner, Inc. and/or its Affiliates. All Rights Reserved.

Page 4 of 7

context. Frequently, the most advanced 3NF designers will use stars to explain the data available in the warehouse and never reveal the underlying 3NF structure. ! A good 3NF design will force users to rethink their own concepts of business processes and their comprehension of source system designs. This often leads to long education sessions. For example, telling a logistics and distribution management team that warehouses and trucks do not exist, but that structures with function, equipment and staff do, can cause significant confusion.

Complexity 3NF schema development and implementation are accomplished through fairly complex principles, providing little or no cost relief when establishing specialized development and support teams. Because 3NF data warehouse models differ significantly from transactional 3NF models, there is an education curve for database administrators (DBAs) as they try to take the same logical model used in a transactional system and deploy it as an analytics data model. ! There is no specific focal point for data in a 3NF schema. Data is distributed throughout the model, which complicates join design and processing, and usually demands at least a view layer for power users. Unfortunately, this is one of the features that provides for future flexibility and extensibility of the warehouse. To mitigate this issue, database architects can deploy views, materialized views, database-managed summary tables and physical stars above the 3NF model. Data movement and load is accomplished through a more complex and multistaged data extraction process from the source. Because data from the source is often subdivided or combined in ways not anticipated by the source system experts, the cost is born from the data warehouse budget and the project schedule. The 3NF design demands that the corporate organizational structure and its assets be brought to some level of 3NF. This is an abstraction process that can frustrate subject matter experts and result in their early departure from the project. Companies that have experienced few mergers or acquisitions have the greatest difficulty with this aspect of the process.

Lack of Experience Base The concept of analytics to discover external factors or a wider context of process issues is often foreign to designers who lack data-warehousing experience. DBAs that maintain systems do so for purchased and in-house-developed systems. Design DBAs usually develop systems to specifically support linear business process management. ! Most systems that claim to be 3NF are not. Many experienced professionals incorrectly assume that familiar entity types are valid under all 3NF deployments. Many of the entities that are assumed to be universal by the experienced development and management staff can be challenged by the third rule of normalization. When designing for flexibility, understand that each compromise between 2NF and 3NF must be evaluated in relation to vertical industry practices and anticipated horizontal business expansion. Experience across more than a single industry and a deep background in data issues relative to mergers and acquisitions are mandatory. The abstraction process becomes tedious because the definition of each entity must be evaluated for the most appropriate meaningful impact vs. academic purity. Often, star schemas do not force this process to take place.

Publication Date: 31 August 2005/ID Number: G00130021 2005 Gartner, Inc. and/or its Affiliates. All Rights Reserved.

Page 5 of 7

ETL staff are frequently drawn from application integration staff. Application design is usually driven by taking a goal, quantifying it into some type of logical data design and then determining the best physical management process that supports the business process steps. When designing a 3NF warehouse, the designer first engineers the data points forward to the intended goal (by talking to the subject matter or source experts). The designer then has to reverse engineer an abstraction process to a more generic analytic model that probably differs significantly from the process model because of context. This second wave of abstraction usually confounds experienced application developers.

Key Issues
Which data warehouse designs and topologies should project managers and architects select to ensure adequate flexibility to adapt to changing business requirements?

Note 1
Normal Form Compromises Some typical examples of the types of compromises to be made can be found in how to deploy a party entity or a facility entity. A party entity is usually deployed with a table for basic identification information, such as birth date, primary name (for example, surname and corporation name), secondary name (for example, DBA and given name) and other identifying data. For a full 3NF deployment, a primary name table could be deployed separately from the party table, as could a secondary name table. Facility can be broken into structure type, purpose, location and others. It becomes quickly apparent that 3NF can become an academic exercise without end.

Note 2
Modeling the Data Warehouse SLA Less-experienced practitioners or those with specific agendas see the 3NF and star schema deployment approaches as diametrically opposed concepts. However, each approach has specific benefits that add to the performance and future life of the data warehouse. As with any type of technology, the benefits accrue from correct use, and the risks emerge from inappropriate use or poor design. Rather than engaging in a political struggle, it's best to address the data warehouse in terms of the service it is expected to provide to users and the organization. A data warehouse SLA must address the following key points: ! ! ! ! ! Flexibility to promote data reuse for future or undefined purposes Grain of detail that promotes integration with operational analytics Cost of ownership controls, such as a single point of maintenance Support for rapid integration during merger and acquisition periods Independence from data models described with front-end tools

Publication Date: 31 August 2005/ID Number: G00130021 2005 Gartner, Inc. and/or its Affiliates. All Rights Reserved.

Page 6 of 7

REGIONAL HEADQUARTERS
Corporate Headquarters 56 Top Gallant Road Stamford, CT 06902-7700 U.S.A. +1 203 964 0096 European Headquarters Tamesis The Glanty Egham Surrey, TW20 9AW UNITED KINGDOM +44 1784 431611 Asia/Pacific Headquarters Gartner Australasia Pty. Ltd. Level 9, 141 Walker Street North Sydney New South Wales 2060 AUSTRALIA +61 2 9459 4600 Japan Headquarters Gartner Japan Ltd. Aobadai Hills, 6F 7-7, Aobadai, 4-chome Meguro-ku, Tokyo 153-0042 JAPAN +81 3 3481 3670 Latin America Headquarters Gartner do Brazil Av. das Naes Unidas, 12551 9 andarWorld Trade Center 04578-903So Paulo SP BRAZIL +55 11 3443 1509

Publication Date: 31 August 2005/ID Number: G00130021 2005 Gartner, Inc. and/or its Affiliates. All Rights Reserved.

Page 7 of 7