You are on page 1of 18

DATA WAREHOUSING

 A data warehouse is a type of data management system that is designed to enable and support business
intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and
analysis and often contain large amounts of historical data. The data within a data warehouse is usually
derived from a wide range of sources such as application log files and transaction applications.
 A data warehouse centralizes and consolidates large amounts of data from multiple sources. Its analytical
capabilities allow organizations to derive valuable business insights from their data to improve decision-
making. Over time, it builds a historical record that can be invaluable to data scientists and business analysts.
Because of these capabilities, a data warehouse can be considered an organization’s “single source of truth.”

BENEFITS OF A DATA WAREHOUSE


 Data warehouses offer the overarching and unique benefit of allowing organizations to analyze large amounts
of variant data and extract significant value from it, as well as to keep a historical record.

Four unique characteristics (described by computer scientist William Inmon, who is considered the father of
the data warehouse) allow data warehouses to deliver this overarching benefit. According to this definition,
data warehouses are
 Subject-oriented. They can analyze data about a particular subject or functional area (such as sales).
 Integrated. Data warehouses create consistency among different data types from disparate sources.
 Nonvolatile. Once data is in a data warehouse, it’s stable and doesn’t change.
 Time-variant. Data warehouse analysis looks at change over time.

A well-designed data warehouse will perform queries very quickly, deliver high data throughput, and provide enough
flexibility for end users to “slice and dice” or reduce the volume of data for closer examination to meet a variety of
demands—whether at a high level or at a very fine, detailed level. The data warehouse serves as the functional
foundation for middleware BI environments that provide end users with reports, dashboards, and other interfaces.

DATA WAREHOUSE ARCHITECTURE


 Simple. All data warehouses share a basic design in which metadata, summary data, and raw data are stored
within the central repository of the warehouse. The repository is fed by data sources on one end and accessed
by end users for analysis, reporting, and mining on the other end.
 Simple with a staging area. Operational data must be cleaned and processed before being put in the
warehouse. Although this can be done programmatically, many data warehouses add a staging area for data
before it enters the warehouse, to simplify data preparation.
 Hub and spoke. Adding data marts between the central repository and end users allows an organization to
customize its data warehouse to serve various lines of business. When the data is ready for use, it is moved to
the appropriate data mart.
 Sandboxes. Sandboxes are private, secure, safe areas that allow companies to quickly and informally explore
new datasets or ways of analyzing data without having to conform to or comply with the formal rules and
protocol of the data warehouse.

What is a Cloud Data Warehouse?


 A cloud data warehouse uses the cloud to ingest and store data from disparate data sources. The original
data warehouses were built with on-premises servers. These on-premises data warehouses continue to have
many advantages today. In many cases, they can offer improved governance, security, data sovereignty, and
better latency. However, on-premises data warehouses are not as elastic and they require complex forecasting
to determine how to scale the data warehouse for future needs. Managing these data warehouses can also be
very complex. On the other hand, some of the advantages of cloud data warehouses include:
 Elastic, scale-out support for large or variable compute or storage requirements
 Ease of use
 Ease of management
 Cost savings

The best cloud data warehouses are fully managed and self-driving, ensuring that even beginners can create and use a
data warehouse with only a few clicks. An easy way to start your migration to a cloud data warehouse is to run your
cloud data warehouse on-premises, behind your data center firewall which complies with data sovereignty and
security requirements.
In addition, most cloud data warehouses follow a pay-as-you-go model, which brings added cost savings to customers.

INTRODUCTION TO DATA WAREHOUSING

What is data warehousing?


 A data warehouse is a relational database which is developed with an aim for query and analysis rather than
for transaction processing.
 It contains historical and cumulative data derived from transaction data from single or multiple sources.
 BWH is a single version of truth for an organization and created for the purpose of help in decision making
and forecasting.

Where is it used?
 It is used for evaluating future strategy.
 The ultimate use of data warehouses is Mass Customization.
For example, it increased Capital One’s customers from 1 million to approximately 9 million in 8 years.

Elements of a typical Data


• A Relational database to store and manage data.
• An extraction, loading, and transformation (ELT) solution for preparing the data for analysis
• Statistical analysis, reporting, and data mining capabilities
• Client analysis tools for visualizing and presenting data to business users
• Other, more sophisticated analytical applications that generate actionable information by applying data
science and artificial intelligence (AI) algorithms, or graph and spatial features that enable more kinds of
analysis of data at scale.

Benefits of Data Warehouse


Data warehouses offer the overarching and unique benefit of allowing organizations to analyze large amounts of
variant data and extract significant value from it, as well as to keep a historical record.
1. Enables Historical Insight
2. Enhances Conformity and Data Quality
3. Boosts Efficiency
4. Increase the Power and Speed of Data Analytics
5. Drives Revenue in Increasing Levels
6. High Scalability
7. Interoperates with On-Premise and Cloud
8. Boosts Data Security
9. Higher Query Performance and Insight
10. Provides Major Competitive Advantage

Father of Data Warehouse

BILL INMON
 the father of the data warehouse
 Co-creator of the Corporate Information Factory.
 He has 35 years of experience in database technology management and data warehouse design.
 Bill has written about a variety of topics on the building, usage, & maintenance of the warehouse & the
Corporate Information Factory.
 he has written more than 650 articles (Datamation, Computer World, and Byte Magazine)
4 UNIQUE CHARACTERISTICS OF DATA WAREHOUSE

1. Subject-oriented. They can analyze data about a particular subject or functional area (such as sales).
2. Integrated. Data warehouses create consistency among different data types from disparate sources.
3. Nonvolatile. Once data is in a data warehouse, it’s stable and doesn’t change.
4. Time-variant. Data warehouse analysis looks at change over time.

EVOLUTION OF DATA WAREHOUSE


(From Data Analytics to AI and Machine Learning)
 Data warehouse first came in the late 1980s.
 Purpose: Help data flow from operational systems into decision-support systems (DSSs)
 These early data warehouses required an enormous amount of redundancy
 As data warehouses became more efficient, they evolved from information stores that supported traditional BI
platforms into broad analytics infrastructures that support a wide variety of applications, such as operational
analytics and performance management.

Data warehouse iterations have progressed over time to deliver incremental additional value to the enterprise
with enterprise data warehouse (EDW).

Step Capability Business Value

1 Transactional reporting Provides relational information to create snapshots of business performance


2 Slice and dice, ad hoc query, BI Expands capabilities for deeper insights and more robust analysis
tools
3 Predicting future performance (data Develops visualizations and forward-looking business intelligence
mining)
4 Tactical analysis (spatial, statistics) Offers “what-if” scenarios to inform practical decisions based on more
comprehensive analysis
5 Stores many months or years of data Stores data for only weeks or months
 Supporting each of these five steps has required an increasing variety of datasets. The last three steps create
the imperative for an even broader range of data and analytics capabilities.
 The autonomous data warehouse is the latest step in this evolution, offering enterprises the ability to extract
even greater value from their data while lowering costs and improving data warehouse reliability and
performance.

DATA WAREHOUSE ARCHITECTURE


Data Warehouse Architecture
⮚ a method of defining the overall architecture of data communication processing and presentation that exist for
end-clients computing within the enterprise.

Three common architectures


 Data Warehouse Architecture: Basic
 Data Warehouse Architecture: With Staging Area
 Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic

Operational System. An operational system is a method used in data warehousing to refer to a system that is used to
process the day-to-day transactions of an organization
Flat Files. A Flat file system is a system of files in which transactional data is stored, and every file in the system
must have a different name.
Meta Data. A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
-summarizes necessary information about data, which can make finding and work with particular instances of data
more accessible.
-used to direct a query to the most appropriate data source
Lightly and highly summarized data. The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized record is updated
continuously as new information is loaded into the warehouse.

End-User access Tools. The principal purpose of a data warehouse is to provide information to the business managers
for strategic decision-making. These customers interact with the warehouse using end-client access tools.

The examples of some of the end-user access tools can be:


● Reporting and Query Tools
● Application Development Tools
● Executive Information Systems Tools
● Online Analytical Processing Tools
● Data Mining Tools

Data Warehouse Architecture: With Staging Area


We must clean and process your operational information before put it into the warehouse. We can do this
programmatically, although data warehouses use a staging area (A place where data is processed before entering the
warehouse).
A staging area simplifies data cleansing and consolidation for operational method coming from multiple source
systems, especially for enterprise data warehouses where all relevant data of an enterprise is consolidated.

Architecture of a Data Warehouse w/ Staging Area

Data Warehouse Staging Area is a temporary location where a record from source systems is copied.

Data Warehouse Architecture: With Staging Area and Data Marts

We may want to customize our warehouse's architecture for multiple groups within our organization.
We can do this by adding data marts. A data mart is a segment of a data warehouses that can provided information for
reporting and analysis on a section, unit, department or operation in the company, e.g., sales, payroll, production, etc.

Properties of Data Warehouse Architectures


The following architecture properties are necessary for a data warehouse system:

Properties of Data Warehouse Architectures

1. Separation: Analytical and transactional processing should be kept apart as much as possible.


2. Scalability: Hardware and software architectures should be simple to upgrade the data volume, which has to be
managed and processed, and the number of user's requirements, which have to be met, progressively increase.
3. Extensibility: The architecture should be able to perform new operations and technologies without redesigning the
whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the data warehouses.
5. Administrability: Data Warehouse management should not be complicated.

Types of Data Warehouse Architectures SINGLE-TIER ARCHITECTURE

SINGLE-TIER ARCHITECTURE is not periodically used in practice. Its purpose is to minimize the amount of
data stored to reach this goal; it removes data redundancies. The vulnerability of this architecture lies in its failure to
meet the requirement for separation between analytical and transactional processing. Analysis queries are agreed to
operational data after the middleware interprets them. In this way, queries affect transactional workloads.

TWO-TIER ARCHITECTURE
Four subsequent data flow stages:
• Source layer
• Data Staging 
• Data Warehouse layer 
• Analysis

Subsequent Data Flow Stages:


 Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored initially to
corporate relational databases or legacy databases, or it may come from an information system outside the
corporate walls.
 Data Staging: The data stored to the source should be extracted, cleansed to remove inconsistencies and fill
gaps, and integrated to merge heterogeneous sources into one standard schema. The so-named Extraction,
Transformation, and Loading Tools (ETL) can combine heterogeneous schemata, extract, transform, cleanse,
validate, filter, and load source data into a data warehouse.
 Data Warehouse layer: Information is saved to one logically centralized individual repository: a data
warehouse. The data warehouses can be directly accessed, but it can also be used as a source for creating data
marts, which partially replicate data warehouse contents and are designed for specific enterprise departments.
Meta-data repositories store information on sources, access procedures, data staging, users, data mart schema,
and so on.
 Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports, dynamically
analyze information, and simulate hypothetical business scenarios. It should feature aggregate information
navigators, complex query optimizers, and customer-friendly GUIs.

THREE-TIER ARCHITECTURE

The three-tier architecture consists of the source layer (containing multiple source system), the reconciled layer and
the data warehouse layer (containing both data warehouses and data marts). The reconciled layer sits between the
source data and data warehouse.

The main advantage of the reconciled layer is that it creates a standard reference data model for a whole enterprise.
At the same time, it separates the problems of source data extraction and integration from those of data warehouse
population. In some cases, the reconciled layer is also directly used to accomplish better some operational tasks, such
as producing daily reports that cannot be satisfactorily prepared using the corporate applications or generating data
flows to feed external processes periodically to benefit from cleaning and integration.

THREE-TIER DATA WAREHOUSE ARCHITECTURE

Data Warehouses usually have a three-level (tier) architecture that includes:


 Bottom Tier (Data Warehouse Server)
 Middle Tier (OLAP Server)
 Top Tier (Front end Tools)

BOTTOM-TIER
A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It may include several
specialized data marts and a metadata repository.
Three-Tier Architecture
Data from operational databases and external sources (such as user profile data provided by external consultants) are
extracted using application program interfaces called a gateway. A gateway is provided by the underlying DBMS and
allows customer programs to generate SQL code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking and Embedding for
Databases), by Microsoft, and JDBC (Java Database Connection).

Three-Tier Architecture for a data warehouse

MIDDLE-TIER
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
The OLAP server is implemented using either:

(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions on
multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly implements
multidimensional information and operations.

TOP-TIER
A top-tier that contains front-end tools for displaying results provided by OLAP, as well as additional tools for data
mining of the OLAP-generated data.

The metadata repository stores information that defines DW objects. It includes the following parameters and
information for the middle and the top-tier applications:
• A description of the DW structure, including the warehouse schema, dimension, hierarchies, data mart
locations, and contents, etc.
• Operational metadata, which usually describes the currency level of the stored data, i.e., active, archived or
purged, and warehouse monitoring information, i.e., usage statistics, error reports, audit, etc.
• System performance data, which includes indices, used to improve data access and retrieval performance.
• Information about the mapping from operational databases, which provides source RDBMSs and their
contents, cleaning and transformation rules, etc.
• Summarization algorithms, predefined queries, and reports business data, which include business terms and
definitions, ownership information, etc.

PRINCIPLES OF DATA WAREHOUSING


Load Performance
Data warehouses require increase loading of new data periodically basis within narrow time windows; performance on
the load process should be measured in hundreds of millions of rows and gigabytes per hour and must not artificially
constrain the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data warehouse, including data conversion, filtering,
reformatting, indexing, and metadata update.
Data Quality Management
Fact-based management demands the highest data quality. The warehouse ensures local consistency, global
consistency, and referential integrity despite "dirty" sources and massive database size.
Query Performance
Fact-based management must not be slowed by the performance of the data warehouse RDBMS; large, complex
queries must be complete in seconds, not days.
Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size from a few to hundreds of gigabytes and
terabyte-sized data warehouses.

DATA WAREHOUSING PROCESS

A data warehouse, also known as an enterprise data warehouse (EDW), is a system that collects data from various
sources and stores it in a single, centralized, consistent location to facilitate data analysis, data mining, artificial
intelligence (AI), and machine learning. As the business evolves, so do its requirements, and a data warehouse must be
developed to keep up with these changes. As a result, a data warehouse system must be versatile.

Ideally there should be a delivery process to deliver a data warehouse. However, data warehouse projects normally
suffer from various issues that make it difficult to complete tasks and deliverables in the strict and ordered fashion
demanded by the waterfall method. Most of the times, the requirements are not understood completely. The
architectures, designs, and build components can be completed only after gathering and studying all the requirements.

DELIVERY METHOD
The delivery method is a variant of the joint application development approach adopted for the delivery of a data
warehouse. We have staged the data warehouse delivery process to minimize risks. The approach that we will discuss
here does not reduce the overall delivery time-scales but ensures the business benefits are delivered incrementally
through the development process.
Note − The delivery process is broken into phases to reduce the project and delivery risk.

IT Strategy
Data warehouse are strategic investments that require a business process to generate benefits. IT Strategy is required
to procure and retain funding for the project.

Business Case
The objective of business case is to estimate business benefits that should be derived from using a data warehouse.
These benefits may not be quantifiable but the projected benefits need to be clearly stated. If a data warehouse does
not have a clear business case, then the business tends to suffer from credibility problems at some stage during the
delivery process. Therefore, in data warehouse projects, we need to understand the business case for investment.

Education and Prototyping


Organizations experiment with the concept of data analysis and educate themselves on the value of having a data
warehouse before settling for a solution. This is addressed by prototyping. It helps in understanding the feasibility and
benefits of a data warehouse. The prototyping activity on a small scale can promote educational process as long as −
• The prototype addresses a defined technical objective.
• The prototype can be thrown away after the feasibility concept has been shown.
• The activity addresses a small subset of eventual data content of the data warehouse.
• The activity timescale is non-critical.

The following points are to be kept in mind to produce an early release and deliver business benefits.
• Identify the architecture that is capable of evolving.
• Focus on business requirements and technical blueprint phases.
• Limit the scope of the first build phase to the minimum that delivers business benefits.
• Understand the short-term and medium-term requirements of the data warehouse.

Business Requirements

To provide quality deliverables, we should make sure the overall requirements are understood. If we understand the
business requirements for both short-term and medium-term, then we can design a solution to fulfil short-term
requirements. The short-term solution can then be grown to a full solution.
The following aspects are determined in this stage −
• The business rule to be applied on data.
• The logical model for information within the data warehouse.
• The query profiles for the immediate requirement.
• The source systems that provide this data.

Technical Blueprint
This phase needs to deliver an overall architecture satisfying the long-term requirements. This phase also delivers the
components that must be implemented in a short term to derive any business benefit. The blueprint needs to identify
the followings.
• The overall system architecture.
• The data retention policy.
• The backup and recovery strategy.
• The server and data mart architecture.
• The capacity plan for hardware and infrastructure.
• The components of database design.

Building the Version. In this stage, the first production deliverable is produced. This production deliverable is the
smallest component of a data warehouse. This smallest component adds business benefit.

History Load. This is the phase where the remainder of the required history is loaded into the data warehouse. In this
phase, we do not add new entities, but additional physical tables would probably be created to store increased data
volumes.

Ad hoc Query. In this phase, we configure an ad hoc query tool that is used to operate a data warehouse. These tools
can generate the database query.
Note − It is recommended not to use these access tools when the database is being substantially modified.

Automation
In this phase, operational management processes are fully automated. These would include −
• Transforming the data into a form suitable for analysis.
• Monitoring query profiles and determining appropriate aggregations to maintain system performance.
• Extracting and loading data from different source systems.
• Generating aggregations from predefined definitions within the data warehouse.
• Backing up, restoring, and archiving the data.

Extending Scope. In this phase, the data warehouse is extended to address a new set of business requirements. The
scope can be extended in two ways −
• By loading additional data into the data warehouse.
• By introducing new data marts using the existing information.
Note − This phase should be performed separately, since it involves substantial efforts and complexity.

Requirements Evolution
 From the perspective of delivery process, the requirements are always changeable. They are not static. The
delivery process must support this and allow these changes to be reflected within the system.

This issue is addressed by designing the data warehouse around the use of data within business processes, as
opposed to the data requirements of existing queries.

The architecture is designed to change and grow to match the business needs, the process operates as a
pseudo-application development process, where the new requirements are continually fed into the
development activities and the partial deliverables are produced. These partial deliverables are fed back to the
users and then reworked ensuring that the overall system is continually updated to meet the business needs.

SYSTEM PROCESSES
 We have a fixed number of operations to be applied on the operational databases and we have well-defined
techniques such as use normalized data, keep table small, etc. These techniques are suitable for delivering a
solution. But in case of decision-support systems, we do not know what query and operation needs to be
executed in future. Therefore, techniques applied on operational databases are not suitable for data
warehouses.

Process Flow in Data Warehouse

There are four major processes that contribute to a data warehouse −


• Extract and load the data.
• Cleaning and transforming the data.
• Backup and archive the data.
• Managing queries and directing them to the appropriate data sources.

Extract and Load Process. Data extraction takes data from the source systems. Data load takes the extracted data and
loads it into the data warehouse.
Note − Before loading the data into the data warehouse, the information extracted from the external sources must be
reconstructed.
Controlling the Process. Controlling the process involves determining when to start data extraction and the
consistency check on data. Controlling process ensures that the tools, the logic modules, and the programs are
executed in correct sequence and at correct time.

When to Initiate Extract. Data needs to be in a consistent state when it is extracted, i.e., the data warehouse should
represent a single, consistent version of the information to the user.

Loading the Data. After extracting the data, it is loaded into a temporary data store where it is cleaned up and made
consistent.
Note − Consistency checks are executed only when all the data sources have been loaded into the temporary data
store.

Clean and Transform Process. Once the data is extracted and loaded into the temporary data store, it is time to
perform Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming −
• Clean and transform the loaded data into a structure
• Partition the data
• Aggregation
Clean and Transform the Loaded Data into a Structure. Cleaning and transforming the loaded data help speed up
the queries. It can be done by making the data consistent −
• within itself.
• with other data within the same data source.
• with the data in other source systems.
• with the existing data present in the warehouse.

Transforming involves converting the source data into a structure. Structuring the data increases the query
performance and decreases the operational cost. The data contained in a data warehouse must be transformed to
support performance requirements and control the ongoing operational costs.

Partition the Data. It will optimize the hardware performance and simplify the management of data warehouse. Here
we partition each fact table into multiple separate partitions.
Aggregation. Aggregation is required to speed up common queries. Aggregation relies on the fact that most common
queries will analyze a subset or an aggregation of the detailed data.
Backup and Archive the Data. In order to recover the data in the event of data loss, software failure, or hardware
failure, it is necessary to keep regular backups. Archiving involves removing the old data from the system in a format
that allow it to be quickly restored whenever required.

Query Management Process


This process performs the following functions −
• manages the queries.
• helps speed up the execution time of queries.
• directs the queries to their most effective data sources.
• ensures that all the system sources are used in the most effective way.
• monitors actual query profiles.

The information generated in this process is used by the warehouse management process to determine which
aggregations to generate. This process does not generally operate during the regular load of information into data
warehouse.

DATA WAREHOUSING CONCEPTS

PROPERTIES OF A DATA WAREHOUSE


“A Data Warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of
management’s decision-making process.” – Bill Inmon, Father of Data Warehousing

PROPERTIES OF A DATA WAREHOUSE


 Subject-oriented. Data is categorized and stored by business subject rather than by application
 Integrated. Data on a given subject is collected from disparate sources and stored in a single place.
 Time-variant. Data is stored as a series of snapshots, each representing a period of time
 Nonvolatile. Typically, data in the data warehouse is not updated or deleted.

USING DATA WAREHOUSE INFORMATION


 There are decision support technologies that help utilize the data available in a data warehouse. These
technologies help executives to use the warehouse quickly and effectively. They can gather data, analyze it,
and take decisions based on the information present in the warehouse. The information gathered in a
warehouse can be used in any of the following domains −
• Tuning Production Strategies − The product strategies can be well tuned by repositioning the products and
managing the product portfolios by comparing the sales quarterly or yearly.
• Customer Analysis − Customer analysis is done by analyzing the customer's buying preferences, buying
time, budget cycles, etc.
• Operations Analysis − Data warehousing also helps in customer relationship management, and making
environmental corrections. The information also allows us to analyze business operations.

INTEGRATING HETEROGENEOUS DATABASES


To integrate heterogeneous databases, we have two approaches −
• Query-driven Approach
• Update-driven Approach

QUERY-DRIVEN APPROACH
This is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and
integrators on top of multiple heterogeneous databases. These integrators are also known as mediators.

PROCESS OF QUERY-DRIVEN APPROACH


• When a query is issued to a client side, a metadata dictionary translates the query into an appropriate form for
individual heterogeneous sites involved.
• Now these queries are mapped and sent to the local query processor.
• The results from heterogeneous sites are integrated into a global answer set.

DISADVANTAGES
• Query-driven approach needs complex integration and filtering processes.
• This approach is very inefficient.
• It is very expensive for frequent queries.
• This approach is also very expensive for queries that require aggregations.

UPDATE-DRIVEN APPROACH

This is an alternative to the traditional approach. Today's data warehouse systems follow update-driven approach
rather than the traditional approach discussed earlier. In update-driven approach, the information from multiple
heterogeneous sources is integrated in advance and are stored in a warehouse. This information is available for direct
querying and analysis
ADVANTAGES
This approach has the following advantages −
• This approach provides high performance.
• The data is copied, processed, integrated, annotated, summarized and restructured in semantic data store in
advance.
• Query processing does not require an interface to process data at local sources.

Functions of Data Warehouse Tools and Utilities


The following are the functions of data warehouse tools and utilities −
• Data Extraction − Involves gathering data from multiple heterogeneous sources.
• Data Cleaning − Involves finding and correcting the errors in data.
• Data Transformation − Involves converting the data from legacy format to warehouse format.
• Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building indices and
partitions.
• Refreshing − Involves updating from data sources to warehouse.

RELATED SYSTEMS

DATA MART
 A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area),
hence they draw data from a limited number of sources such as sales, finance or marketing. Data marts are
often built and controlled by a single department within an organization. The sources could be internal
operational systems, a central data warehouse, or external data. Denormalization is the norm for data
modeling techniques in this system. Given that data marts generally cover only a subset of the data contained
in a data warehouse, they are often easier and faster to implement. Types of data marts include dependent,
independent, and hybrid data marts.
Online analytical processing (OLAP). characterized by a relatively low volume of transactions. Queries are often
very complex and involve aggregations. For OLAP systems, response time is an effective measure. OLAP
applications are widely used by Data Mining techniques. OLAP databases store aggregated, historical data in multi-
dimensional schemas (usually star schemas). OLAP systems typically have a data latency of a few hours, as opposed
to data marts, where latency is expected to be closer to one day. The OLAP approach is used to analyze
multidimensional data from multiple sources and perspectives. The three basic operations in OLAP are Roll-up
(Consolidation), Drill-down, and Slicing & Dicing.

Online transaction processing (OLTP). is characterized by a large number of short on-line transactions (INSERT,
UPDATE, DELETE). OLTP systems emphasize very fast query processing and maintaining data integrity in multi-
access environments. For OLTP systems, effectiveness is measured by the number of transactions per second. OLTP
databases contain detailed and current data. The schema used to store transactional databases is the entity model
(usually 3NF). Normalization is the norm for data modeling techniques in this system.

ETL-based data warehousing


 The typical extract, transform, load (ETL)-based data warehouse uses staging, data integration, and access
layers to house its key functions. The staging layer or staging database stores raw data extracted from each of
the disparate source data systems. The integration layer integrates the disparate data sets by transforming the
data from the staging layer often storing this transformed data in an operational data store (ODS) database.
The integrated data are then moved to yet another database, often called the data warehouse database, where
the data is arranged into hierarchical groups, often called dimensions, and into facts and aggregate facts. The
combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve
data.
 The main source of the data is cleansed, transformed, catalogued, and made available for use by managers and
other business professionals for data mining, online analytical processing, market research and decision
support. However, the means to retrieve and analyze data, to extract, transform, and load data, and to manage
the data dictionary are also considered essential components of a data warehousing system. Many references
to data warehousing use this broader context. Thus, an expanded definition for data warehousing
includes business intelligence tools, tools to extract, transform, and load data into the repository, and tools to
manage and retrieve metadata.

DATA WAREHOUSING DESIGN AND IMPLEMENTATION

You might also like