This action might not be possible to undo. Are you sure you want to continue?
of data in support of management's decision making process". By Historical we mean, the data is continuously collected from sources and loaded in the warehouse. The previously loaded data is not deleted for long period of time. This results in building historical data in the warehouse. By Subject Oriented we mean data grouped into a particular business area instead of the business as a whole. By Integrated we mean, collecting and merging data from various sources. These sources could be disparate in nature. By Time-variant we mean that all data in the data warehouse is identified with a particular time period. By Non-volatile we mean, data that is loaded in the warehouse is based on business transactions in the past, hence it is not expected to change over time. Data Warehousing (DWH) Data warehousing is the vast field that includes the processes and methodologies involved in the creating, populating, querying and maintaining a data warehouse. The objective of DWH is to help users make better business decisions. Enterprise Data Warehouse (EDW) Enterprise Data Warehouse (EDW) is a central normalized repository containing multiple subject area data for an organization. EDW are generally used for feeding data to subject area specific data marts. We must consider building an EDW when 1. We have clarity on multiple business processes requirements. 2. Want a centralized architecture catering to single version of truth. Data Mart When we restrict the scope of a data warehouse to a sub set of business process or subject area we call it a Data Mart. Data Marts can be stand-alone entities, built from warehouses or be used to build warehouse. 1. Stand-Alone Data Marts – These are single subject areas data warehouses. Stand alone data marts could cause many stove-pipe/silo data marts in the future. 2. Data Marts fed from Enterprise Data Warehouses (Top-Down). This is the Top-Down strategy suggested by Bill Inmon. Here we build an enterprise data warehouse (EDW) hosting data for multiple subject areas. From the EDW we build subject area specific data marts. 3. Data Marts building Data Warehouses (Bottom-Up). This is the Bottom-Up strategy suggested by Ralph Kimball. Here we build subject area specific data marts first, keeping in mind the shared
In the simplest form of data profiling. Data Anomalies like Junk values. In the most basic form. Applications running on legacy systems are generally OLTP systems. In the world of DWH. Programming languages like PERL. values outside a given range. In the world of DWH. Some examples are Trillium Software and IBM Websphere Data Integrator’s Profile Stage. In other words. OLTP are the automated day to day business processes. OLTP systems contain the current valued business data that is updated based on business transactions. A lot of Data Profiling tools are available in the market. We create the conceptual model followed by logical model followed by physical model. Text. Data modeling refers to creating the structure and relationship of the data for the business rules defined in the requirements. Using the conformed dimensions we build a warehouse. Online Transaction Processing Systems (OLTP) Online Transaction Processing Systems (OLTP) are transactional systems that perform the day to day business operations. they are usually one of the types of source system from where the day to day business transactional data is extracted. define exceptional cases when we get incorrect or inconsistent data. Data Modeling This activity is performed once we have understood the requirements for the data warehouse/mart. Due to their large storage capabilities the business transactional data from legacy applications and other OLTP systems is stored on them and maintained for a long period of time. invalid characters. . legacy systems are often used to extract past historical business data. PLSQL and Shell Scripting can be very handy in creating these applications. Here we see a progression in the level of detail as we proceed toward the physical model. Mainframes and AS400 are the two examples of legacy systems. data profiling process would read the source data and report on the Type of data – Whether data is Numeric. most occurred value. Java. Legacy Systems Legacy system is the term given to a system having very large storage capabilities. Data statistics – Minimum and maximum values. Data Profiling is a process that familiarizes you with the data you will be loading into the warehouse/mart.business attributes (called conformed dimensions). incorrect business codes. Data Profiling This is a very important activity mostly performed while collecting the requirements and the system study. Date etc. we can even hand code (programming) data profile application specific to projects. Using data profiling we can validate business requirements with respect to business codes and attributes present in the sources.
The business metrics are associated with the business attributes by relating the dimension tables with the fact table. Third Normal Form (3NF) – This rule demands that every aspect of each and every attributes depends on nothing but the key. A normalized data model may result in many tables/entities having multiple levels of relationships. and the relationships between the subject areas. Using the physical model we develop the database objects through the DDL's developed in the physical model.Conceptual Model Conceptual Model identifies at a high level the subject areas involved. . Theoretically We have further rules called the Boyce-Codd Normal Form. Second Normal Form – This rule demands that every aspect of each and every attribute depends on Key. and Third Normal Form to achieve a normalized data model. Primary Keys. constraints on the table and the privileges on the tables. In practice we don’t use the rules beyond 3NF. attributes. Normalization This is a technique used in data modeling that emphasizes on avoiding storing the same data element at multiple places. candidate keys. This structure is also called the star schema. Fourth Normal Form and the Fifth Normal form. data types and the relationship with other entities. data types and size of the columns. example table1 related to table2. Columns within the tables. table3 related to table 4 and so on. The dimension tables are not related to one another. Dimensional Modeling This is the form of data modeling used in data warehousing where we store business related attributes in tables called Dimension Tables and the business related metrics/measures in tables called Fact Tables. The objective of creating the conceptual model is to identify the scope of the project in terms of the subject areas involved. A fact table could be related to many dimension table and hence form the center of the structure when graphically represented. Second Normal Form. table2 further related to table3. Physical Model Physical Model identifies the table structure. Logical Model Logical Model identifies the details of the subject areas identified in the conceptual model. First Normal Form – The attributes of the entity must be atomic and must depend on the Key. We follow the 3 rules of normalization called the First Normal Form. Logical model identifies the entities within the subject area and documents in detail the definition of the entity.
We check for valid values of data by performing ETL data validations rules. There are two objectives of an ODC 1. integrated data. Date.Please note that here we do not emphasize on normalizing the data instead have lesser levels of relationships. This is an ongoing process and is performed periodically based on the availability of source data (eg daily. Transformation – This is the process where we apply the business rules to the source data before loading it to the warehouse/mart. Source systems could range from one to many in number. The current. Extraction – This is the process where we select the data from various source systems and transport it in an ETL server. In this case we will make sure we convert “male” to “m” and “female” to “f” so that the gender coming from the second source system is consistent with that from the first source system. data within the range. Operational Data Store (ODS) Operational Data Store (ODS) is layer of repository that resides between a data warehouse and the source system. The rules could be defined to check for correct data formats (Numeric. preparing the data with respect to the requirements and loading it in the warehouse/mart. Incremental Load – This follows the history load. Text). The ODS is updated more frequently than a data warehouse. and similar or completely disparate in nature. History Load / Bulk Load – This is the process of loading the historical data over a period of time into the warehouse. change data type etc. generate running numbers. There are two types of loads we perform in data warehousing. Transformation also adds further business context to the data. weekly. Source DWH with detailed. we need to check whether the data is valid and ensure consistency when coming from disparate sources. aggregate data. We can ensure consistency of data when it is collected from disparate sources by making sure that the same business attribute will have the same representation. Transformation as the term suggests converts the data based on the business rules. correct business codes. Online Analytical Processing (OLAP) . Based on the availability of the current source data we load the data into the warehouse. 2. grouped subject areas wise in the ODS. we load it into the warehouse/mart. The types of transformations we generally use in ETL are sort data. The updates happen shortly after a source system data changes. detailed data from disparate source systems is integrated. Cleansing – Once the data has been extracted. This is usually a one time activity where we take the historical business data and bulk load into the warehouse/mart. compare data. Provide organization with operational Integration. Extract Transform Load (ETL) The back-end processes involved in collecting data from various sources. check for junk values. Example would be the gender of customer may be represented as “m” and “f” in one source system and “male” and “female” in another source system. monthly etc). Loading – Once the source data is transformed by applying business rules. merge data.
and statistical functions that make the decision making process even more sophisticated and advanced. . It is basically an automated system to gather business data and help make management business decisions. methodologies and technologies used for collecting. The fact tables are connected to the dimension tables through the Foreign-Key relationships. Each axis represents a category of business called a dimension (eg location. Relational OLAP (ROLAP) – This uses database to store the business categories as dimension tales (containing textual business attributes and codes) and the fact tables to store the business transactional values (numeric measures). Data Mining Data Mining in the broad sense deals with extracting relevant information from very large amount of data and make intelligent decision based on the information extracted. employee etc). The objective of DWH and BI go hand-in-hand. that of enabling users make better business decisions. analyzing and presenting business information. Business Intelligence (BI) Business Intelligence (BI) is a generic term used for the processes. The objective of BI is to support users make better business decisions. The coordinate of each data point related it to the business dimension. BI Tools are generally accompanied with reporting features like Slice-and-dice. time. The data points are the business transactional values called measures. Hybrid OLAP (HOLAP) – This uses a cube structure to store some business attributes and database tables to hold the other attributes. you would need the following components Source System(s) – A source system is generally a collection of OLTP System and/or a legacy systems that contains day to day business data. The cubes are generally stored in proprietary format. Conceptually it can be considered as multi-dimensional graph having various axes. Decision Support Systems (DSS) A Decision Support System (DSS) is used for management based decision making reporting. ETL Server – These are generally Symmetric Multi-Processing (SMP) machines or Massively Parallel Processing Machines(MPP). Data marts/warehouses provide easily accessible historical data to the BI tools to provide end users with forecasting/predictive analytical capabilities and historical business patters and trends. There are various forms of OLAP namely Multidimensional OLAP (MOLAP) – This uses a cube structure to store the business attributes and measures.Online Analytical Processing (OLAP) is the approach involved in providing analysis to multidimensional data. Visualization. Pivot-table-analysis. Building blocks of Data WareHousing If you had to build a Data WareHouse. product.
Database Server – Usually we have the database reside on the ETL server itself. An integrated metadata definition system containing metadata definition from source. Generally the data in the landing zone is in the raw format. Staging Area -Usually a file-system and a database used to house the transformed and cleansed source data before loading it to the warehouse or mart. mart and the reports. Landing Zone – Usually a file-system and sometimes a database used to land the source data from various source systems. transforming and cleaning data and loading the data into databases. ETL. database and ETL tool have their own metadata. landing zone. we need to have Metadata defined starting from source systems. warehouse.These are generally Symmetric Multi-Processing machines or Massively Parallel Processing Machines. But if we look at the overall picture. warehouse. ETL tool resides on the ETL server. landing. Metadata Management Tool – Metadata is defined as description of the data stored in your system. All data warehouses applications have many logical layers that are separate. ETL. Reporting Tool – A sophisticated GUI based application that enables users to view. Generally we avoid any further data cleansing and transformation after we have loaded data in the staging area. Each layer interacts with the other layers by passing the required data. staging. . The architecture also defines the interfaces between the components and the external environment. In case of very large amount of data we may want to have a separate Symmetric Multi-Processing machines or Massively Parallel Processing Machines clustered to exploit parallel processing of the RDBMS. In data warehousing the Reporting tool. The format of the data in staging area is similar to that of in the warehouse or mart. mart and reporting is advisable. Data Warehouse – A Database housing the normalized (top-Down) or conformed star schemas (bottomup) Reporting Server . create and schedule reports on data warehouse. The Data Warehouse architecture defines the components and the structure of the data warehouse. staging area.ETL Tool – These are sophisticated GUI based applications that enable the operation of Extracting data from source systems.
Source Layer This layer contains all the separate OLTP and Legacy systems that provide the business data. These systems are physically separate from each other. Please Note that adding a database at the landing area adds cost of extra infrastructure and maintenance. We could use the database to house the source data to provide a sophisticated and fast way of handling reject data reprocessing (one of the DWH Exception Handling processes) requirements for very low tolerance level DWH. We could store the source data in files or load it into database tables. databases and files. We generally use files to store the intermediate data while transformation and cleansing. . This layer could contain data from different operating systems. The advantage of using the staging database is that we add a point in the ETL flow where we can restart the load from. Generally we store the source data in files. Staging Area This is the layer where the cleansed and transformed data is temporarily stored. Once the data is ready to be loaded to the warehouse. Landing Area This is the layer where source data before being transformed is temporarily stored on the ETL Server. The other advantages of using staging database is that we can directly utilize the bulk load utilities provided by the databases and ETL tools while loading the data in the warehouse/mart. we load it in the staging database. and provide a point in the data flow where we can audit the data.
Geographical location of the source systems In the case of business operational systems (source systems) located at different geographical locations. These dimension tables are called Conformed Dimension tables. Care must be taken when we build the first data mart to avoid a stove-pipe architecture in the future when requirements for other subject areas are clear. In the case when the overall business requirements are clear for all subject areas.. Using the conformed dimensions we can integrate multiple data marts to form a data warehouse. The DWH architecture can be affected by various factors namely Requirements. This term is called as stovepipe or siloed structure. This approach is called the top-down approach. From the EDW we build the various subject specific data marts. we have two options namely 1. Build the local warehouse/mart and using the local warehouse/mart data build the global warehouse. Geographical location of the source systems and the Budget Requirements impact to DWH architecture In case of business not clear about the overall business requirements for all or multiple subject areas. In the future once requirements for another subject area is clear we tend to build another standalone data mart and so on. This approach is called the bottom-up approach. we first build an Enterprise Data Warehouse (EDW) which is a third normal form (3NF) schema. The data here is used for business analysis. same data may not be consistent across data marts. suggested by Bill Inmon. We do so by designing dimension tables in the data mart that can be shared with other data marts. suggested by Ralph Kimball. Here multiple data marts would have the same data loaded as many times and.Warehouse/Mart Layer This is the layer where we store the transformed data. This leads to a standalone data mart. it would be safer to understand the requirements for a single subject area and develop the data mart for that subject area. This is called the Federated architecture or . Presentation Layer This layer contains the tools and processes used to read the data from the warehouse/mart and present it to the end user in a graphical format. This creates multiple standalone data marts.
Unix systems and Windows based systems... The source systems could be on differing platforms like Mainframes. Extract data from various locations and collect them at a central location and build the mart/warehouse from the data extracted at the central location. Budget Each layer in the DWH could mean more hardware.. merchandizing systems. Human Resource systems etc. Here we perform the ETL operations at both the local as well as the corporate locations. The source systems could be similar in operations but be geographically distributed in an organization operating globally.. 2..the Distributed architecture. . This would definitely be more cost effective but would loose out on the advantages of having database in the landing area and staging area..-A source system is generally a collection of operational OLTP System and/or a legacy systems that contains day to day business data. Types of Source Systems OLTP Systems (On Line Transaction System) .. OLTP .Transactional systems that perform the day to day business operations.. OLTP are the automated day to day business processes. When the business is conservative in its spending we may not want to go for databases in the landing area and staging area.. The source systems could be different in the operations like financial systems. This is called the Centralized architecture.. more software licenses and eventually higher maintenance cost. In other words.
Operational system having very large storage capabilities. legacy systems are often used to extract past historical business data. ERP Systems (Enterprise Resource Planning) . Applications running on legacy systems are generally OLTP systems. JD Edwards. Unix. They could be geographically distributed. ERP systems contain extremely large set of complex integrated tables within the framework. They could be applications catering to differing business operations. some at different times due to geographical distribution of the sources. These are usually Web Based applications or Legacy applications. They could be on differing platforms like Mainframes.Up Approach (Suggested by Ralph Kimball) Top Down Approach (Suggested by Bill Inmon) . There are two strategies to build a data warehouse namely Top . In the world of DWH. PeopleSoft. some on a weekly basis. Characteristics of Source Systems They could be many in numbers. Some sources could provide data on a daily basis. inventory etc on a single database with a common framework. The availability of the data from source systems could vary on the time the data is ready.ERP systems are integrated enterprise solutions integrating the various business functions like financials. The periodicity of feeding the data to the warehouse/mart could vary from source to source.Down Approach (Suggested by Bill Inmon) Bottom . some on a monthly basis etc. Examples of ERP systems are SAP. They could be within the operations of the organization or external to the operations of the organization. Due to their large storage capabilities the business transactional data from legacy applications and other OLTP systems is stored on them and maintained for a long period of time. Legacy Systems . Windows. having differing time zones. Some source systems could provide data at the end of the business day. accounting.systems contain the current valued business data that is updated based on business transactions. human resources. Mainframes and AS400 are the two examples of legacy systems.
The data in the EDW is stored at the most detail level. The disadvantages of storing data at the detail level are .In the top down approach suggested by Bill Inmon. Flexibility to cater for future requirements. The central repository for corporate wide data helps us maintain one version of truth of the data. The data in the EDW is stored in a normalized form in order to avoid redundancy. we build a centralized repository to house corporate wide business data. Flexibility to be used by multiple departments. 2. The reason to build the EDW on the most detail level is to leverage 1. This repository is called Enterprise Data Warehouse (EDW).
The business has to wait for the EDW to be implemented followed by building the data marts before which they can access their reports. The business has complete clarity on all or multiple subject areas DWH requirements. hence increased cost. 2. This is very important for the data to be reliable. The disadvantages of using the Top Down approach is that it requires more time and initial investment. Bottom Up Approach (Suggested by Ralph Kimball) . The complexity of design increases with increasing level of detail. consistent across subject areas and for reconciliation incase of data related contention between subject areas. The business is ready to invest considerable time and money. Once the EDW is implemented we start building subject area specific data marts which contain data in a denormalized form also called star schema.1. The advantage of using the Top Down approach is that we build a centralized repository to cater for one version of truth for business data. The reason to denormalize the data in the mart is to provide faster access to the data for the end users analytics. we would end up in a complex multiple level joins that would be much slower as compared to the one on the denormalized schema. We should implement the top-down approach when 1. If we were to have queried a normalized schema for the same analytics. 2. The data in the marts are usually summarized based on the end users analytical requirements. It takes large amount of space to store data at detail level.
same dimensions joined to it and at the same granularity across data marts. A Conformed dimension has consistent dimension keys.The bottom up approach suggested by Ralph Kimball is an incremental approach to build a data warehouse. consistent attribute names and consistent values across separate data marts. Here we build the data marts separately at different points of time as and when the specific subject area requirements are clear. A Conformed fact has the same definition of measures. The data marts are integrated or combined together to form a data warehouse. A conformed dimension and a conformed fact is one that can be shared across data marts. . Separate data marts are combined through the use of conformed dimensions and conformed facts. The conformed dimension means exact same thing with every fact table it is joined.
hence the business can start using the marts much earlier as compared to the top-down approach. We dont have to wait for knowing the over all requirements of the warehouse. hence there would be high space usage for detailed data.e. flexibility to easily cater to future requirements. We have initial cost and time constraints. We have clarity to only one data mart.i. We should implement the bottom up approach when 1.In order to build conformed dimensions and facts we need to create a Bus Matrix with the rows corresponding to the various data marts and columns corresponding to all the dimension tables as depicted in the diagram above. 2. The complete warehouse requirements are not clear. We have a tendency of not keeping detailed data in this approach hence loosing out on advantage of having detail data . . Care must be taken to build an exhaustive bus matrix from the first data mart itself identifying all possible dimensions. The bottom up approach helps us incrementally build the warehouse by developing and integrating data marts as and when the requirements are clear. The advantage of using the Bottom Up approach is that they do not require high initial costs and have a faster implementation time. The disadvantages of using the Bottom Up approach is that it stores data in the denormalized format. otherwise we would be building a stove-pipes in the organization.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.