Professional Documents
Culture Documents
A data warehouse is a type of data management system that is designed to enable and support business
intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and
analysis and often contain large amounts of historical data. The data within a data warehouse is usually
derived from a wide range of sources such as application log files and transaction applications.
A data warehouse centralizes and consolidates large amounts of data from multiple sources. Its analytical
capabilities allow organizations to derive valuable business insights from their data to improve decision-
making. Over time, it builds a historical record that can be invaluable to data scientists and business analysts.
Because of these capabilities, a data warehouse can be considered an organization’s “single source of truth.”
Four unique characteristics (described by computer scientist William Inmon, who is considered the father of
the data warehouse) allow data warehouses to deliver this overarching benefit. According to this definition,
data warehouses are
Subject-oriented. They can analyze data about a particular subject or functional area (such as sales).
Integrated. Data warehouses create consistency among different data types from disparate sources.
Nonvolatile. Once data is in a data warehouse, it’s stable and doesn’t change.
Time-variant. Data warehouse analysis looks at change over time.
A well-designed data warehouse will perform queries very quickly, deliver high data throughput, and provide enough
flexibility for end users to “slice and dice” or reduce the volume of data for closer examination to meet a variety of
demands—whether at a high level or at a very fine, detailed level. The data warehouse serves as the functional
foundation for middleware BI environments that provide end users with reports, dashboards, and other interfaces.
The best cloud data warehouses are fully managed and self-driving, ensuring that even beginners can create and use a
data warehouse with only a few clicks. An easy way to start your migration to a cloud data warehouse is to run your
cloud data warehouse on-premises, behind your data center firewall which complies with data sovereignty and
security requirements.
In addition, most cloud data warehouses follow a pay-as-you-go model, which brings added cost savings to customers.
Where is it used?
It is used for evaluating future strategy.
The ultimate use of data warehouses is Mass Customization.
For example, it increased Capital One’s customers from 1 million to approximately 9 million in 8 years.
BILL INMON
the father of the data warehouse
Co-creator of the Corporate Information Factory.
He has 35 years of experience in database technology management and data warehouse design.
Bill has written about a variety of topics on the building, usage, & maintenance of the warehouse & the
Corporate Information Factory.
he has written more than 650 articles (Datamation, Computer World, and Byte Magazine)
4 UNIQUE CHARACTERISTICS OF DATA WAREHOUSE
1. Subject-oriented. They can analyze data about a particular subject or functional area (such as sales).
2. Integrated. Data warehouses create consistency among different data types from disparate sources.
3. Nonvolatile. Once data is in a data warehouse, it’s stable and doesn’t change.
4. Time-variant. Data warehouse analysis looks at change over time.
Data warehouse iterations have progressed over time to deliver incremental additional value to the enterprise
with enterprise data warehouse (EDW).
Operational System. An operational system is a method used in data warehousing to refer to a system that is used to
process the day-to-day transactions of an organization
Flat Files. A Flat file system is a system of files in which transactional data is stored, and every file in the system
must have a different name.
Meta Data. A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
-summarizes necessary information about data, which can make finding and work with particular instances of data
more accessible.
-used to direct a query to the most appropriate data source
Lightly and highly summarized data. The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized record is updated
continuously as new information is loaded into the warehouse.
End-User access Tools. The principal purpose of a data warehouse is to provide information to the business managers
for strategic decision-making. These customers interact with the warehouse using end-client access tools.
Data Warehouse Staging Area is a temporary location where a record from source systems is copied.
We may want to customize our warehouse's architecture for multiple groups within our organization.
We can do this by adding data marts. A data mart is a segment of a data warehouses that can provided information for
reporting and analysis on a section, unit, department or operation in the company, e.g., sales, payroll, production, etc.
SINGLE-TIER ARCHITECTURE is not periodically used in practice. Its purpose is to minimize the amount of
data stored to reach this goal; it removes data redundancies. The vulnerability of this architecture lies in its failure to
meet the requirement for separation between analytical and transactional processing. Analysis queries are agreed to
operational data after the middleware interprets them. In this way, queries affect transactional workloads.
TWO-TIER ARCHITECTURE
Four subsequent data flow stages:
• Source layer
• Data Staging
• Data Warehouse layer
• Analysis
THREE-TIER ARCHITECTURE
The three-tier architecture consists of the source layer (containing multiple source system), the reconciled layer and
the data warehouse layer (containing both data warehouses and data marts). The reconciled layer sits between the
source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for a whole enterprise.
At the same time, it separates the problems of source data extraction and integration from those of data warehouse
population. In some cases, the reconciled layer is also directly used to accomplish better some operational tasks, such
as producing daily reports that cannot be satisfactorily prepared using the corporate applications or generating data
flows to feed external processes periodically to benefit from cleaning and integration.
BOTTOM-TIER
A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It may include several
specialized data marts and a metadata repository.
Three-Tier Architecture
Data from operational databases and external sources (such as user profile data provided by external consultants) are
extracted using application program interfaces called a gateway. A gateway is provided by the underlying DBMS and
allows customer programs to generate SQL code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking and Embedding for
Databases), by Microsoft, and JDBC (Java Database Connection).
MIDDLE-TIER
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
The OLAP server is implemented using either:
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions on
multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly implements
multidimensional information and operations.
TOP-TIER
A top-tier that contains front-end tools for displaying results provided by OLAP, as well as additional tools for data
mining of the OLAP-generated data.
The metadata repository stores information that defines DW objects. It includes the following parameters and
information for the middle and the top-tier applications:
• A description of the DW structure, including the warehouse schema, dimension, hierarchies, data mart
locations, and contents, etc.
• Operational metadata, which usually describes the currency level of the stored data, i.e., active, archived or
purged, and warehouse monitoring information, i.e., usage statistics, error reports, audit, etc.
• System performance data, which includes indices, used to improve data access and retrieval performance.
• Information about the mapping from operational databases, which provides source RDBMSs and their
contents, cleaning and transformation rules, etc.
• Summarization algorithms, predefined queries, and reports business data, which include business terms and
definitions, ownership information, etc.
A data warehouse, also known as an enterprise data warehouse (EDW), is a system that collects data from various
sources and stores it in a single, centralized, consistent location to facilitate data analysis, data mining, artificial
intelligence (AI), and machine learning. As the business evolves, so do its requirements, and a data warehouse must be
developed to keep up with these changes. As a result, a data warehouse system must be versatile.
Ideally there should be a delivery process to deliver a data warehouse. However, data warehouse projects normally
suffer from various issues that make it difficult to complete tasks and deliverables in the strict and ordered fashion
demanded by the waterfall method. Most of the times, the requirements are not understood completely. The
architectures, designs, and build components can be completed only after gathering and studying all the requirements.
DELIVERY METHOD
The delivery method is a variant of the joint application development approach adopted for the delivery of a data
warehouse. We have staged the data warehouse delivery process to minimize risks. The approach that we will discuss
here does not reduce the overall delivery time-scales but ensures the business benefits are delivered incrementally
through the development process.
Note − The delivery process is broken into phases to reduce the project and delivery risk.
IT Strategy
Data warehouse are strategic investments that require a business process to generate benefits. IT Strategy is required
to procure and retain funding for the project.
Business Case
The objective of business case is to estimate business benefits that should be derived from using a data warehouse.
These benefits may not be quantifiable but the projected benefits need to be clearly stated. If a data warehouse does
not have a clear business case, then the business tends to suffer from credibility problems at some stage during the
delivery process. Therefore, in data warehouse projects, we need to understand the business case for investment.
The following points are to be kept in mind to produce an early release and deliver business benefits.
• Identify the architecture that is capable of evolving.
• Focus on business requirements and technical blueprint phases.
• Limit the scope of the first build phase to the minimum that delivers business benefits.
• Understand the short-term and medium-term requirements of the data warehouse.
Business Requirements
To provide quality deliverables, we should make sure the overall requirements are understood. If we understand the
business requirements for both short-term and medium-term, then we can design a solution to fulfil short-term
requirements. The short-term solution can then be grown to a full solution.
The following aspects are determined in this stage −
• The business rule to be applied on data.
• The logical model for information within the data warehouse.
• The query profiles for the immediate requirement.
• The source systems that provide this data.
Technical Blueprint
This phase needs to deliver an overall architecture satisfying the long-term requirements. This phase also delivers the
components that must be implemented in a short term to derive any business benefit. The blueprint needs to identify
the followings.
• The overall system architecture.
• The data retention policy.
• The backup and recovery strategy.
• The server and data mart architecture.
• The capacity plan for hardware and infrastructure.
• The components of database design.
Building the Version. In this stage, the first production deliverable is produced. This production deliverable is the
smallest component of a data warehouse. This smallest component adds business benefit.
History Load. This is the phase where the remainder of the required history is loaded into the data warehouse. In this
phase, we do not add new entities, but additional physical tables would probably be created to store increased data
volumes.
Ad hoc Query. In this phase, we configure an ad hoc query tool that is used to operate a data warehouse. These tools
can generate the database query.
Note − It is recommended not to use these access tools when the database is being substantially modified.
Automation
In this phase, operational management processes are fully automated. These would include −
• Transforming the data into a form suitable for analysis.
• Monitoring query profiles and determining appropriate aggregations to maintain system performance.
• Extracting and loading data from different source systems.
• Generating aggregations from predefined definitions within the data warehouse.
• Backing up, restoring, and archiving the data.
Extending Scope. In this phase, the data warehouse is extended to address a new set of business requirements. The
scope can be extended in two ways −
• By loading additional data into the data warehouse.
• By introducing new data marts using the existing information.
Note − This phase should be performed separately, since it involves substantial efforts and complexity.
Requirements Evolution
From the perspective of delivery process, the requirements are always changeable. They are not static. The
delivery process must support this and allow these changes to be reflected within the system.
This issue is addressed by designing the data warehouse around the use of data within business processes, as
opposed to the data requirements of existing queries.
The architecture is designed to change and grow to match the business needs, the process operates as a
pseudo-application development process, where the new requirements are continually fed into the
development activities and the partial deliverables are produced. These partial deliverables are fed back to the
users and then reworked ensuring that the overall system is continually updated to meet the business needs.
SYSTEM PROCESSES
We have a fixed number of operations to be applied on the operational databases and we have well-defined
techniques such as use normalized data, keep table small, etc. These techniques are suitable for delivering a
solution. But in case of decision-support systems, we do not know what query and operation needs to be
executed in future. Therefore, techniques applied on operational databases are not suitable for data
warehouses.
Extract and Load Process. Data extraction takes data from the source systems. Data load takes the extracted data and
loads it into the data warehouse.
Note − Before loading the data into the data warehouse, the information extracted from the external sources must be
reconstructed.
Controlling the Process. Controlling the process involves determining when to start data extraction and the
consistency check on data. Controlling process ensures that the tools, the logic modules, and the programs are
executed in correct sequence and at correct time.
When to Initiate Extract. Data needs to be in a consistent state when it is extracted, i.e., the data warehouse should
represent a single, consistent version of the information to the user.
Loading the Data. After extracting the data, it is loaded into a temporary data store where it is cleaned up and made
consistent.
Note − Consistency checks are executed only when all the data sources have been loaded into the temporary data
store.
Clean and Transform Process. Once the data is extracted and loaded into the temporary data store, it is time to
perform Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming −
• Clean and transform the loaded data into a structure
• Partition the data
• Aggregation
Clean and Transform the Loaded Data into a Structure. Cleaning and transforming the loaded data help speed up
the queries. It can be done by making the data consistent −
• within itself.
• with other data within the same data source.
• with the data in other source systems.
• with the existing data present in the warehouse.
Transforming involves converting the source data into a structure. Structuring the data increases the query
performance and decreases the operational cost. The data contained in a data warehouse must be transformed to
support performance requirements and control the ongoing operational costs.
Partition the Data. It will optimize the hardware performance and simplify the management of data warehouse. Here
we partition each fact table into multiple separate partitions.
Aggregation. Aggregation is required to speed up common queries. Aggregation relies on the fact that most common
queries will analyze a subset or an aggregation of the detailed data.
Backup and Archive the Data. In order to recover the data in the event of data loss, software failure, or hardware
failure, it is necessary to keep regular backups. Archiving involves removing the old data from the system in a format
that allow it to be quickly restored whenever required.
The information generated in this process is used by the warehouse management process to determine which
aggregations to generate. This process does not generally operate during the regular load of information into data
warehouse.
QUERY-DRIVEN APPROACH
This is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and
integrators on top of multiple heterogeneous databases. These integrators are also known as mediators.
DISADVANTAGES
• Query-driven approach needs complex integration and filtering processes.
• This approach is very inefficient.
• It is very expensive for frequent queries.
• This approach is also very expensive for queries that require aggregations.
UPDATE-DRIVEN APPROACH
This is an alternative to the traditional approach. Today's data warehouse systems follow update-driven approach
rather than the traditional approach discussed earlier. In update-driven approach, the information from multiple
heterogeneous sources is integrated in advance and are stored in a warehouse. This information is available for direct
querying and analysis
ADVANTAGES
This approach has the following advantages −
• This approach provides high performance.
• The data is copied, processed, integrated, annotated, summarized and restructured in semantic data store in
advance.
• Query processing does not require an interface to process data at local sources.
RELATED SYSTEMS
DATA MART
A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area),
hence they draw data from a limited number of sources such as sales, finance or marketing. Data marts are
often built and controlled by a single department within an organization. The sources could be internal
operational systems, a central data warehouse, or external data. Denormalization is the norm for data
modeling techniques in this system. Given that data marts generally cover only a subset of the data contained
in a data warehouse, they are often easier and faster to implement. Types of data marts include dependent,
independent, and hybrid data marts.
Online analytical processing (OLAP). characterized by a relatively low volume of transactions. Queries are often
very complex and involve aggregations. For OLAP systems, response time is an effective measure. OLAP
applications are widely used by Data Mining techniques. OLAP databases store aggregated, historical data in multi-
dimensional schemas (usually star schemas). OLAP systems typically have a data latency of a few hours, as opposed
to data marts, where latency is expected to be closer to one day. The OLAP approach is used to analyze
multidimensional data from multiple sources and perspectives. The three basic operations in OLAP are Roll-up
(Consolidation), Drill-down, and Slicing & Dicing.
Online transaction processing (OLTP). is characterized by a large number of short on-line transactions (INSERT,
UPDATE, DELETE). OLTP systems emphasize very fast query processing and maintaining data integrity in multi-
access environments. For OLTP systems, effectiveness is measured by the number of transactions per second. OLTP
databases contain detailed and current data. The schema used to store transactional databases is the entity model
(usually 3NF). Normalization is the norm for data modeling techniques in this system.