This action might not be possible to undo. Are you sure you want to continue?
A data warehouse is a repository of an organization's electronically stored data, designed to facilitate reporting and analysis. It is Subject-oriented: The data in the data warehouse is organized so that all the data elements relating to the same realworld event or object are linked together. Non-Volatile: Data in the data warehouse is never over-written or deleted — once committed, the data is static, read-only, and retained for future reporting. Integrated: The data warehouse contains data from most or all of an organization’s operational systems and this data are made consistent. Time-Variant There are two leading approaches to storing data in a data warehouse — the dimensional approach and the normalized approach. In a dimensional approach, transaction data are partitioned into either "facts", which are generally numeric transaction data, or "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. In the normalized approach, the data in the data warehouse are stored following, to a degree, database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to: Join data from different sources into meaningful information and then Access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.
Bottom-up design: In the so-called bottom-up approach data marts are first created to provide reporting and analytical capabilities for specific business processes. These data marts can eventually be integrated to create a comprehensive data warehouse. Top-Down design:
A data warehouse is a centralized repository for the entire enterprise. The data warehouse is designed using a normalized enterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse.
Redundant or De-Normalize:
Duplication of data. Has more data than needed. Data is expressed in more than one place.
A data mart is a subset of an organizational data store, usually oriented to a specific purpose or major data subject, which may be distributed to support business needs. Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization. Data marts are often derived from subsets of data in a data warehouse, though in the bottom-up data warehouse design methodology the data warehouse is created from the union of organizational data marts.
Star Schema or Dimensional model Snowflake Schema Star Schema:
The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few fact tables (possibly only one, justifying the name) referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema. The facts that data warehouses helps analyze are classified along different dimensions: the fact tables hold the main data, while the usually smaller dimension tables describe each value of a dimension and can be joined to fact tables as needed.
Dimension tables have a simple primary key, while fact tables have a set of foreign keys which make up a compound primary key consisting of a combination of relevant dimension keys. Reason for using a star schema is its simplicity from the users' point of view: queries are never complex because the only joins and conditions involve a fact table and a single level of dimension tables, without the indirect dependencies to other tables that are possible in a better normalized snowflake schema. E.g.:
A snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake in shape. Closely related to the star schema, the snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. In the snowflake schema, however, dimensions are normalized into multiple related tables whereas the star schema's dimensions are denormalized with each dimension being represented by a single table. When the dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and where child tables have multiple parent tables ("forks in the road"), a complex snowflake shape starts to emerge. The "snowflaking" effect only affects the dimension tables and not the fact tables. E.g.:
Reasons for creating a Data mart:
Easy access to frequently needed data Creates collective view by a group of users Improves end-user response time Ease of creation Lower cost than implementing a full Data warehouse Potential users are more clearly defined than in a full Data warehouse
Extract, Transform, Load (ETL):
Extract, transform, and load (ETL) is a process in database usage and especially in data warehousing that involves:
Extracting data from outside sources Transforming it to fit operational needs (which can include quality levels) Loading it into the end target (database or data warehouse)
The first part of an ETL process involves extracting the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization/format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or screen-scraping. Extraction converts the data into a format for transformation processing. An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.
The transform stage applies a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the target database:
Selecting only certain columns to load (or selecting null columns not to load). For example, if source data has three columns (also called attributes) say roll_no, age and salary then the extraction may take only roll_no and salary.
Similarly, extraction mechanism may ignore all those records where salary is not present (salary = null). Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female), this calls for automated data cleansing; no manual cleansing occurs during ETL Encoding free-form values (e.g., mapping "Male" to "1" and "Mr" to M) Deriving a new calculated value (e.g., sale_amount = qty * unit_price) Filtering Sorting Joining data from multiple sources (e.g., lookup, merge) Aggregation (for example, rollup — summarizing multiple rows of data — total sales for each store, and for each region, etc.) Generating surrogate-key values Transposing or pivoting (turning multiple columns into multiple rows or vice versa) Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns) Disaggregation of repeating columns into a separate detail table (e.g., moving a series of addresses in one record into single addresses in a set of records in a linked address table) Lookup and validate the relevant data from tables or referential files for slowly changing dimensions. Applying any form of simple or complex data validation. If validation fails, it may result in a full, partial or no rejection of the data, and thus none, some or all the data are handed over to the next step, depending on the rule design and exception handling. Many of the above transformations may result in exceptions, for example, when a code translation parses an unknown code in the extracted data.
The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative, frequently updating extract data is done on daily, weekly or monthly. While other DW (or even other parts of the same DW) may add new data in a historicized form.
Dimensions in Data warehouse:
In a data warehouse, a dimension is a data element that categorizes each item in a data set into non-overlapping regions. A data warehouse dimension provides the means to "slice and dice" data in a data warehouse. Dimensions provide structured labeling information to otherwise unordered numeric measures. For example, "Customer", "Date", and "Product" are all dimensions that could be applied meaningfully to a sales receipt. A dimensional data element is similar to a categorical variable in statistics.
The primary function of dimensions is threefold: to provide filtering, grouping and labeling. For example, in a data warehouse where each person is categorized as having a gender of male, female or unknown, a user of the data warehouse would then be able to filter or categorize each presentation or report by either filtering based on the gender dimension or displaying results broken out by the gender.
Types of Dimensions:
Conformed Dimension Junk Dimension Degenerated Dimension Role-Playing Dimension
Conformed Dimension: Dimensions are conformed when they are either exactly the same (including keys) or one is a perfect subset of the other. Most important, the row headers produced in the answer sets from two different conformed dimensions must be able to match perfectly. Conformed dimensions are either identical or strict mathematical subsets of the most granular, detailed dimension. Dimension tables are not conformed if the attributes are labeled differently or contain different values. Conformed dimensions come in several different flavors. At the most basic level, conformed dimensions mean the exact same thing with every possible fact table to which they are joined. The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts. Junk Dimension: A junk dimension is a convenient grouping of typically low-cardinality flags and indicators. By creating an abstract dimension, these flags and indicators are removed from the fact table while placing them into a useful dimensional framework. Degenerated Dimension: A dimension key, such as a transaction number, invoice number, ticket number, or bill-of-lading number, that has no attributes and hence does not join to an actual dimension table. Degenerate dimensions are very common when the grain of a fact table represents a single transaction item or line item because the degenerate dimension represents the unique identifier of the parent. Degenerate dimensions often play an integral role in the fact table's primary key. Role-Playing Dimension:
Dimensions are often recycled for multiple applications within the same database. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". This is often referred to as a "role-playing dimension".
Slowly Changing Dimension (SCD):
Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes. For example, you may have a Dimension in your database that tracks the sales records of your company's salespeople. Creating sales reports seems simple enough, until a salesperson is transferred from one regional office to another. How do you record such a change in your sales Dimension? You could create a second salesperson record and treat the transferred person as a new sales person, but that creates problems also. Dealing with these issues involves SCD management methodologies referred to as Type 0 through 6. Type 6 SCDs are also sometimes called Hybrid SCDs.
Type 0: The Type 0 method is a passive approach to managing dimension value changes, in which no action is taken. Values remain as they were at the time the dimension record was first entered. In certain circumstances historical preservation with a Type 0 SCD may occur. But, higher order SCD types are often employed to guarantee history preservation, whereas Type 0 provides the least control or no control over managing a slowly changing dimension. Type 1: The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. The obvious disadvantage to this method of managing SCDs is that there is no historical record kept in the data warehouse. But an advantage to this is that these are very easy to maintain. Type 2: The Type 2 method tracks historical data by creating multiple records in the dimensional tables with separate keys. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made. E.g.: Table that keeps supplier information.
If the supplier moves to Illinois, the table would look like this:
Another popular method for tuple versioning is to add effective date columns.
Type 3: The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns we designate for storing historical data. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 will add additional columns to the tables:
Note that this record cannot track all historical changes, such as when a supplier moves twice. Type 4: The Type 4 method is usually just referred to as using "history tables", where one table keeps the current data and an additional table is used to keep a record of some or all changes. Following the example above, the original table might be called Supplier and the history table might be called Supplier_History.
Type 6/Hybrid: The Type 6 method is one that combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). The approach is to use a Type 1 slowly changing dimension, but adding an additional pair of date columns to indicate the date range at which a particular row in the dimension applies and a flag to indicate if the record is the current record. This approach has a number of advantages: The user can choose to query using the current values of the dimensional table by restricting the rows in the Dimension table using a filter to only select current values Alternatively the user can use the "as at the time of the transaction" values by using one of the date fields on the transaction as a constraint on the dimension table. If there are a number of date columns on the transaction (e.g. Order Date, Shipping Date, Confirmation Date) then the user can choose which date to analyze the fact data by - something not possible using other approaches. This is how the Supplier table would look using Type 6 Slowly Changing Dimensions: