Data Warehousing Architecture - Designing the Data Staging Area

By Denise Rogers The staging area tends to be one of the more overlooked components of a data warehouse architecture, and yet it is an integral part of the ETL component design. Learn why it is best to design the staging layer right the first time, enabling support of various ETL processes and related methodology, recoverability and scalability. In any data warehousing initiative, there are several common components to the architecture. There are the data sources and targets, ETL framework, infrastructure, application layer and the data staging area. The staging area, in my experience has to be one of the more overlooked and underestimated components of a data warehouse architecture. I think mostly this is due to a lack of understanding as to what exactly it is. If a quick search is made through a number of websites, many definitions will include the fact the data staging area is simply a temporary workspace used to transform and enrich data before it flows into the operational data store (ODS) and the data warehouse. This is a good fundamental definition of the data staging area. However, it is so much more. How much more do you ask? Well in reality, the data staging area is an information hub that facilitates the enriching stages that data goes through in order to populate an ODS and/or data warehouse. It is the essential ingredient in the development of an approach and/or methodology for creating a comprehensive data-centric solution for any data warehousing project. If we really think about this, the data staging area is an integral part of the ETL component design and is the foundation for the ETL architecture.

The Design of the Information Hub
The data staging area has been labeled appropriately and with good reason. With any data warehousing effort, we all know that data will be transformed and consolidated from any number of disparate and heterogeneous sources. However, the design of a robust and scalable information hub is framed and scoped out by functional and non-functional requirements. Examples of some of these requirements include items such as the following: y y y y The amount of raw source data to retain after it has been processed through the ETL data lifecycle The type of server(s) to house the staging area will be dedicated or shared with other applications and environments (dedicated servers are a proven way to go) The acceptable levels of data quality, related baselines and metrics as stated by the Data Governance Board Decisions on the data sources that will be federated in and the ones that will be a copy of the sources

y y y y The management of metadata as data sources are brought into the landing zone of the staging area The level of security and roles defined for each of the areas with the staging environment The masking/scrambling of sensitive data within staging areas The identification of recoverable artifacts in the event of disasters. Data Acquisition This process includes landing the data physically or logically in order to initiate the ETL processing lifecycle. With these types of requirements. and counts that reflect the quality of the source data coming in. In most profiling efforts. Data Standardization and Matching Data standardization and matching is a set of processes that primarily consists of the design and execution. etc. Data Profiling Data profiling is the surveying of the source data landscape to gain an understanding of the condition of the data sources. statistics. we know that these stages must include the following.g. This includes investigative jobs to provide additional detail in detecting data patterns and design alternatives for quality enforcement at the attribute. There is also the analysis of reports based on the findings and results of the investigation and data correction jobs to determine if further refinements and/or modifications are to be made. a scalable and secured framework is firmly in place to facilitate the defined ETL methodology. The staging area here could include a series of sequential files. relational or federated data objects. as well as provide direct links and/or integrating points to the metadata repository so that appropriate entries can be made for all data sources landing in the intake area. ). this means generating various reports with any number of metrics. It also includes the analysis of reports related to errors and/or exceptions and determines if further refinements or . Data Cleansing Data cleansing is an iterative set of processes that starts and ends with the business rules and standards around acceptable data quality levels from the Data Governance Board (e. the design of intake area or landing zone must enable the subsequent ETL processes. rules and decisions. These data sources go through a number of evolutionary stages in order to build a robust and comprehensive data warehouse and/or ODS. 95% of the data meets the quality standards). standardizing jobs to create uniformity around specific mandatory data elements. However. record and aggregate levels and data correction jobs to fill in missing or incomplete data and correct data values. as great data architects that we are. This includes the design and execution of matching and de-duplicating jobs to eliminate duplicate data and create a single version of the truth. Moreover.

Then there is the SECURITY component! This is live production data that has highly sensitive information. data quality and transformation rules prior to the actual data population of the DW and/or ODS. the loading phase can include a total data refresh of the target component or adding new data to the data component in a historical manner. Also the file systems allocated to the containers that the database uses should be separate from the file systems used in the data acquisition process so that there are no I/O bottleneck issues. having a robust security framework is an . the data architect and the DBA will need to create separate environments for each stage that the data goes through. The design of the database instance must take into consideration the fact that with the use of federated data. there a number of unique tasks that need to be completed to align the staging area to the ETL methodology discussed in prior sections of this article. Therefore. Data Loading Depending on business requirements. The tasks included in these stages are the reading of every data element and record in order to generate detailed statistical information on the source data. This data cannot be masked and/or scrambled as this defeats the whole purpose of the ETL process to stage data into the data warehouse or ODS. The raw data must be exposed in order for the ETL to be as effective in integrating. creating the database instances and related database objects are common elements in the design of any infrastructure dedicate to a data environment. Design and Construction The creation of a staging area will usually start with the typical activities of the design of any data environment. a dedicated database instance and related file systems should be created for the data acquisition and profiling stages.modifications are to be made (if required) and to assess the readiness for data delivery to the data warehouse and ODS. gender codes that are disparate to a common set of values or even merging multiple source fields to one data element. there may be implications at the database level that will cause ripple effects on the other data objects within the database instance. For example. alignment of file systems. This means separate database and file systems that are dedicated to the stage that the ETL lifecycle is in. This means that processes involved in the profiling effort will be using tremendous amounts of resources related to memory and CPU and should be segregated so that other workloads are not adversely impacted. Tasks such as server configuration. For starters. However. Data Transformation Transforming data essentially means converting data to conform to a standard established by the Data Governance Board. This includes verification of referential integrity. Loading to a staged copy of the target component enables a series of validation exercises. cleansing and standardizing all data from all sources. Examples of data transformations include converting nulls to specific values.

Failure to do that will lead to many sleepless nights. However. The lessons here are to design the staging layer to enable support of various ETL processing and related methodology. We successfully installed the toolset. You should too! . days spent in war rooms and putting the data warehouse project in jeopardy of not meeting milestones and deadlines. I knew I would get it right! I created an information hub that had file systems and a database.essential ingredient in this configuration. It was our first time working with an ETL solution and all that comes with it. Except that during the data profiling process of the federated objects. we built a flimsy foundation for the ETL component and we paid dearly for it! At another time. Talk about the panic! I had that look in my eyes! Everything ground to a screeching halt while I completed the database recovery. I worked at another client site as part of a team to design and construct a data warehouse environment complete with an ETL solution. etc. I have been on both sides and not being a big fan of the war rooms. The information hub should be able to satisfy all requests for data access for analysis in a robustly secured environment. tables and views. This time. (validation of business rules) profiling reports. data stewards. the data steward and an appointed business analyst should be among the chosen few that have access to some of the sensitive data elements. having grown from that experience. the process ran out of temporary space at the source application and aborted. Typically. it was an extremely painfully project. The error message generated was that the database is corrupt and all is lost. created the protocols to pull in the data sources and target data warehouse components. A well-designed staging area should enable the ETL approach. This database had federated objects and every kind of bell and whistle you could think of. processes and services and the facilitation of the data management activities with business analysts. In other words. recoverability and scalability. there were no clean ways to restart anything! The staging layer was the sum total of several file systems allocated for ETL usage and not much else was in place at the staging area level. Why? Because whenever the ETL processes aborted or there were hardware failures. I now know better. DBA and system administrator does not need to see any of it. quality reports and successfully stage the data required to populate the data warehouse and the operational data store. No one should be allowed to make copies of anything for any purpose.Tales from the Data Layer I was assigned to the first data warehouse project at a major healthcare company. The ETL developer. There is also the prevention of copying data. The Information Hub Experience .