You are on page 1of 4

P age |1

Chapter 1: Introduction to Data Warehousing (General Terminology) a. What is a Data Warehouse? A data warehouse is a system that retrieves and consolidates data periodically from one or more source systems into either a dimensional or normalized data store. It keeps years of history and is queried for business intelligence or other analytical activities. b. A data warehouse is usually updated in batches, not every time a transaction happens in the source system. c. The source systems are the Online Transactional Processing systems (OLAP) that contain data that you want to load into the DW. The OLAP system is dedicated to capturing and storing business transaction. d. The source system data is examined using a Data Profiler to understand the characteristics of data. e. The Data Profiler is the tool that has the capability to analyze data, such as find the number of rows in a table or the number of rows in a table that contain null values, and so on. f. The ETL system usually integrates, transforms and loads the data into a dimensional data store. The ETL system then brings data from various source systems into a staging area. g. A Dimensional Data Store (DDS) is a database that stores the data warehouse data in a different format than the OLTP. h. The reason for getting the data from the source system into the DDS and then querying the DDS instead of querying the source system directly is that in a DDS the data is arranged in a dimensional format that is more suitable for analysis. The second reason is because a DDS contains integrated data from several source systems. i. When the ETL system loads the data into the DDS, the data quality rules do various data quality checks. Bad data is put into the data quality (DQ) database to be reported and then corrected in the source systems. j. The ETL system is managed and orchestrated by the control system, based on the sequence, rules, and logic stored in the metadata. k. The metadata is a database containing information about the data structure, the data meaning, the data usage, the data quality rules, and other information about the data. l. The Audit system logs the system operations and usage into the metadata database. The audit system is part of the ETL system that monitors the operational activities of the ETL processes and logs their operational statistics. It is used for understanding what happened during the ETL process. m. Some applications operate on a multidimensional database format. For these applications, the data in the DDS is loaded into multidimensional database (MDBs), which are known as cubes. A multidimensional database is a form of database where data is stored in cells and the position of each cell is defined by a number of variables called dimensions. Each cell represents a business event, and the values of the dimensions show when and where this event happened. Each cell in the diagram represents an event where a customer is buying something from a store at a particular time.

P age |2
n. Tools such as analytics applications, data mining, scorecards, dashboards, multidimensional reporting tools, and other BI tools can retrieve data interactively from multidimensional databases. They retrieve the data to produce various features and results on the front-end screens that enable the users to get a deeper understanding about their businesses. o. An example of an analytic application is to analyze the sales by time, customer and product. The users can analyze the amount of sales for a certain month, region and product type.

Retrieve Data: a. The ETL system consists of a set of processes that retrieve data from the source systems transforms the data and loads the data into a target system. b. Most ETL systems also have mechanisms to clean the data from the source system before putting it into the warehouse. c. Data cleansing is the process of identifying and correcting dirty data from source systems. d. This is implemented using data quality rules that define what dirty data is. e. After data is extracted from the source systems but before the data is loaded into the warehouse, the data is examined using these rules. f. If the data quality rule determines that the data is correct, then the data is loaded into the warehouse. If the data quality rules determines that the data is incorrect, the there are three options: it can be rejected, corrected or allowed to be loaded into the warehouse. Which action is appropriate for a particular piece of data depends on the situation, the risk level, the rule type (error or warning), and so on. Data Cleansing will be discussed in more detail in Chapter 9. Consolidate Data: A data warehouse consolidates many transactional systems. The key difference between a data warehouse and a front-office transactional system is that the data in the data warehouse is integrated. g. Data availability: when consolidating data from different source systems, it is possible that a piece of data is available in one system but not in the other system. For example, systems A may have seven address fields (address1, address2, address3, city, county, ZIP, and country), but system B does not have the address3 field and the country field. In system A, an order may have two levels order header and order line. However, in system B, an order has four levels order header, order bundle, order line item, and financial components. So when consolidating data across different transaction systems, you need to be aware of unavailable columns and missing levels in the hierarchy. In the previous examples, you can leave address3 blank in the target and set the country to a default value. In the order hierarchy example, you can consolidate into two levels, order header and order line.

P age |3
h. Time ranges: The same piece of data exists in different systems, but has different time periods. You always need to examine what time period is applicable to which data before you consolidate them. For example, say in system A the average supplier cost is calculated weekly, but in system B it is calculated monthly. You need to go back upstream to get the individual components that make up the average supplier overhead cost in both systems and then add them up first. i. Definitions: You need to examine the meaning of each piece of data. Just because two fields have the same name doesnt mean they are the same. This is important because you could have inaccurate data or meaningless data in the DW, if you consolidate data with different meanings.

j. Conversion: When consolidating data across different systems, sometimes you need to do data conversion because the data in the source system is in different units of measure. If you add them up without converting them first, then you will have incorrect values in the DW. In some cases, the conversion rate is fixed (always the same value), but in other cases, the conversion rate changes from time to time. The conversion rate between one currency and another currency fluctuates every day, so when converting, you need to know when the transaction happened. k. Matching: Matching is a process of determining whether a piece of data in one system is the same as the data in another system. Sometimes criterias are simple, such as using IDs, customer IDs, or account IDs. If you match wrong customers, the transaction from one customer could be mixed up with data from another customer.

Periodically: The data retrieval interval needs to be the same as the source systems data update frequency. If the source system is updated once a day, you need to set the data retrieval to once a day. The data retrieval must satisfy the business requirements. For example if the business need the product profitability report once a week, then the data from various source systems need to be consolidated at least once a week. Another example is when a company states to its customer that it will take 24 hours to cancel a marketing subscription. Then data in the CRM data warehouse needs to be updated a few times a day; otherwise, you risk sending marketing campaigns to customers who have already cancelled their subscriptions.

P age |4
Dimensional Data Store: A data warehouse is a system that retrieves and consolidates data from either one or more source systems into either a dimensional data store or a normalized data store. a. A Dimensional Data store is one or several databases containing a collection of dimensional data marts. A dimensional data mart is a group of related fact tables and their corresponding dimension tables containing the measurements of business events categorized by their dimensions. b. A dimensional data store is denormalized, and the dimensions are conformed. c. A dimensional data store can be implemented physically in the form of several different schemas. Examples of the dimensional data store schema are the star schema. In a star schema a dimension does not have a subset or sub table. Normalized Data Store: Other types of DWs put the data not in a dimensional data store but in a normalized data store. The normalized data store is or more relational databases with little or no data redundancy. A relational database is a database that consists of entity tables with parent child relationships between them. Normalization is the process of removing data redundancy by implementing normalization rules. A dimensional data store is a better format to store data in the data warehouse for the purpose of querying and analyzing data than a normalized data store. A normalized data store is a better format to integrate data from various source systems, especially in third normal form and higher. This is because there is only one place to update without data redundancy.

You might also like