Professional Documents
Culture Documents
Data Warehouse Tutorial Point
Data Warehouse Tutorial Point
tutorialspoint.com
i
ABOUT THE TUTORIAL
Audience
This reference has been prepared for the computer science graduates to help them understand the basic to advanced
concepts related to Data Warehousing.
Prerequisites
Before proceeding with this tutorial you should have a understanding of basic database concepts such as schema, ER
model, Structured Query language etc.
This tutorial may contain inaccuracies or errors and tutorialspoint provides no guarantee regarding the accuracy of the site
or its contents including this tutorial. If you discover that the tutorialspoint.com site or this tutorial content contains some
errors, please contact us at webmaster@tutorialspoint.com
iii
Table of Contents
Data Warehouse Tutorial ............................................................ i
Audience ...................................................................................... i
Prerequisites ................................................................................ i
Copyright & Disclaimer Notice .................................................... i
Overview .................................................................................... 1
Understanding Data Warehouse .......................................................................... 1
Definition ............................................................................................................ 2
Why Data Warehouse Separated from Operational Databases ............................ 2
Data Warehouse Features ................................................................................... 2
Data Warehouse Applications .............................................................................. 3
Data Warehouse Types ........................................................................................ 3
Data Warehouse Concepts ....................................................... 5
What is Data Warehousing? ................................................................................ 5
Using Data Warehouse Information..................................................................... 5
Integrating Heterogeneous Databases ................................................................. 5
Query Driven Approach ....................................................................................... 6
PROCESS OF QUERY DRIVEN APPROACH: .........................................................................6
DISADVANTAGES............................................................................................................6
Update Driven Approach........................................................................................................6
ADVANTAGES .................................................................................................................6
Data Warehouse Tools and Utilities Functions ..............................................................................7
Terminologies ............................................................................ 8
Data Warehouse .................................................................................................. 8
Metadata Respiratory .......................................................................................... 9
Data cube ............................................................................................................ 9
Illustration of Data cube ...................................................................................... 9
Data mart .......................................................................................................... 11
Points to remember about data marts: ............................................................... 11
Virtual Warehouse ............................................................................................. 12
iii
Delivery Process ...................................................................... 13
Introduction ....................................................................................................... 13
Delivery Method ................................................................................................ 13
IT Strategy ......................................................................................................... 14
Business Case .................................................................................................... 14
Education and Prototyping ................................................................................ 14
Business Requirements ...................................................................................... 15
Technical Blueprint ........................................................................................... 15
Building the version ........................................................................................... 16
History Load...................................................................................................... 16
Ad hoc Query..................................................................................................... 16
Automation ........................................................................................................ 17
Extending Scope ................................................................................................ 17
Requirements Evolution ..................................................................................... 17
System Processes .................................................................... 18
Process Flow in Data Warehouse ...................................................................... 18
Extract and Load Process .................................................................................. 19
CONTROLLING THE PROCESS....................................................................... 19
WHEN TO INITIATE EXTRACT ........................................................................ 19
LOADING THE DATA....................................................................................... 20
Clean and Transform Process ............................................................................ 20
CLEAN AND TRANSFORM THE LOADED DATA INTO A STRUCTURE ........ 20
PARTITION THE DATA .................................................................................... 20
AGGREGATION................................................................................................ 20
Backup and Archive the data ............................................................................. 21
Query Management Process .............................................................................. 21
Architecture ............................................................................. 22
Business Analysis Framework ............................................................................ 22
Three-Tier Data Warehouse Architecture .......................................................... 23
Data Warehouse Models .................................................................................... 24
VIRTUAL WAREHOUSE ................................................................................... 24
DATA MART ..................................................................................................... 24
ENTERPRISE WAREHOUSE ............................................................................ 25
Load Manager ................................................................................................... 25
LOAD MANAGER ARCHITECTURE ................................................................ 25
EXTRACT DATA FROM SOURCE .................................................................... 26
FAST LOAD ...................................................................................................... 26
SIMPLE TRANSFORMATIONS ......................................................................... 26
Warehouse Manager .......................................................................................... 27
iii
WAREHOUSE MANAGER ARCHITECTURE ................................................... 27
OPERATIONS PERFORMED BY WAREHOUSE MANAGER .................................................. 27
Query Manager ................................................................................................. 28
QUERY MANAGER ARCHITECTURE ................................................................................ 28
Detailed information.......................................................................................... 29
Summary Information ........................................................................................ 30
OLAP........................................................................................ 31
Introduction ....................................................................................................... 31
Types of OLAP Servers ...................................................................................... 31
Relational OLAP (ROLAP) ................................................................................ 31
Multidimensional OLAP (MOLAP) .................................................................... 32
Hybrid OLAP (HOLAP) ..................................................................................... 32
Specialized SQL Servers .................................................................................... 32
OLAP Operations .............................................................................................. 32
ROLL-UP .......................................................................................................... 32
DRILL-DOWN ................................................................................................... 33
SLICE ................................................................................................................ 34
DICE ................................................................................................................. 35
PIVOT ............................................................................................................... 36
OLAP vs OLTP .................................................................................................. 38
Relational OLAP ..................................................................... 39
Introduction ....................................................................................................... 39
Points to remember ............................................................................................ 39
Relational OLAP Architecture ........................................................................... 39
Advantages ........................................................................................................ 40
Disadvantages ................................................................................................... 40
Multidimentional OLAP ......................................................... 41
Introduction ....................................................................................................... 41
Points to remember: .......................................................................................... 41
MOLAP Architecture ......................................................................................... 41
Advantages ........................................................................................................ 42
Disadvantages ................................................................................................... 42
MOLAP vs ROLAP ............................................................................................ 42
Schema ..................................................................................... 43
Introduction ....................................................................................................... 43
Star Schema ....................................................................................................... 44
Snowflake Schema ............................................................................................. 45
Fact Constellation Schema ................................................................................ 46
Schema Definition ............................................................................................. 47
iii
SYNTAX FOR CUBE DEFINITION ................................................................... 47
SYNTAX FOR DIMENSION DEFINITION ........................................................ 47
Star Schema Definition ...................................................................................... 47
Snowflake Schema Definition ............................................................................. 47
Fact Constellation Schema Definition ................................................................ 48
Partioning Strategy ................................................................ 49
Introduction ....................................................................................................... 49
Why to Partition ................................................................................................ 49
FOR EASY MANAGEMENT .............................................................................. 49
TO ASSIST BACKUP/RECOVERY .................................................................... 49
TO ENHANCE PERFORMANCE ...................................................................... 50
Horizontal Partitioning...................................................................................... 50
PARTITIONING BY TIME INTO EQUAL SEGMENTS...................................... 50
PARTITIONING BY TIME INTO DIFFERENT-SIZED SEGMENTS.................. 50
PARTITION ON A DIFFERENT DIMENSION .................................................. 51
PARTITION BY SIZE OF TABLE ...................................................................... 51
PARTITIONING DIMENSIONS......................................................................... 52
ROUND ROBIN PARTITIONS .......................................................................... 52
Vertical Partition ............................................................................................... 52
NORMALIZATION ............................................................................................ 53
ROW SPLITTING .............................................................................................. 54
Identify Key to Partition..................................................................................... 54
Metadata Concepts ................................................................. 55
What is Metadata............................................................................................... 55
Categories of Metadata ..................................................................................... 55
Role of Metadata ............................................................................................... 56
Metadata Respiratory ........................................................................................ 57
Challenges for Metadata Management............................................................... 58
Data Marting ........................................................................... 59
Why to create Datamart ..................................................................................... 59
Steps to determine that data mart appears to fit the bill ..................................... 59
IDENTIFY THE FUNCTIONAL SPLITS ............................................................ 60
IDENTIFY USER ACCESS TOOL REQUIREMENTS ........................................ 61
IDENTIFY ACCESS CONTROL ISSUES ........................................................... 61
Designing Data Marts ....................................................................................... 61
Cost of Data Marting ......................................................................................... 62
HARDWARE AND SOFTWARE COST .............................................................. 62
NETWORK ACCESS ......................................................................................... 63
TIME WINDOW CONSTRAINTS ....................................................................... 63
iii
System Managers .................................................................... 64
Introduction ....................................................................................................... 64
System Configuration Manager.......................................................................... 64
System Scheduling Manager .............................................................................. 65
System Event Manager....................................................................................... 66
EVENTS ............................................................................................................ 66
System and Database Manager .......................................................................... 67
System Backup Recovery Manager..................................................................... 67
Process Managers ................................................................... 69
Data Warehouse Load Manager ........................................................................ 69
LOAD MANAGER ARCHITECTURE ................................................................ 69
EXTRACT DATA FROM SOURCE .................................................................... 70
FAST LOAD ...................................................................................................... 70
SIMPLE TRANSFORMATIONS ......................................................................... 70
Warehouse Manager .......................................................................................... 71
WAREHOUSE MANAGER ARCHITECTURE ................................................... 71
OPERATIONS PERFORMED BY WAREHOUSE MANAGER ........................... 72
Query Manager ................................................................................................. 72
QUERY MANAGER ARCHITECTURE .............................................................. 72
OPERATIONS PERFORMED BY QUERY MANAGER ...................................... 73
Security .................................................................................... 74
Introduction ....................................................................................................... 74
Requirements ..................................................................................................... 74
Factor to Consider for Security requirements .................................................... 75
USER ACCESS .................................................................................................. 75
Data Classification ............................................................................................ 75
User classification ............................................................................................. 75
Classification on basis of Role ........................................................................... 77
AUDIT REQUIREMENTS ................................................................................. 77
NETWORK REQUIREMENTS ........................................................................... 78
DATA MOVEMENT .......................................................................................... 78
Documentation .................................................................................................. 79
Impact of Security on Design ............................................................................. 79
APPLICATION DEVELOPMENT...................................................................... 80
DATABASE DESIGN ......................................................................................... 80
TESTING ........................................................................................................... 80
Backup ..................................................................................... 81
Introduction ....................................................................................................... 81
Backup Terminologies ....................................................................................... 81
iii
Hardware Backup .............................................................................................. 82
TAPE TECHNOLOGY ....................................................................................... 82
Tape Media........................................................................................................ 82
Standalone tape drives ....................................................................................... 83
TAPE STACKERS.............................................................................................. 83
TAPE SILOS ...................................................................................................... 83
Other Technologies............................................................................................ 83
DISK BACKUPS................................................................................................ 84
Disk-to-disk backups .......................................................................................... 84
Mirror Breaking ................................................................................................ 84
OPTICAL JUKEBOXES .................................................................................... 84
Software Backups .............................................................................................. 84
CRITERIA FOR CHOOSING SOFTWARE PACKAGES .................................... 85
Tuning ...................................................................................... 86
Introduction ....................................................................................................... 86
Difficulties in Data Warehouse Tuning .............................................................. 86
Performance Assessment ................................................................................... 86
Data Load Tuning.............................................................................................. 87
Integrity Checks ................................................................................................. 88
Tuning Queries .................................................................................................. 88
FIXED QUERIES .............................................................................................. 88
AD HOC QUERIES ........................................................................................... 89
Testing ..................................................................................... 90
Introduction ....................................................................................................... 90
UNIT TESTING ................................................................................................. 90
INTEGRATION TESTING ................................................................................. 90
SYSTEM TESTING ............................................................................................ 90
Test Schedule ..................................................................................................... 91
DIFFICULTIES IN SCHEDULING THE TESTING ........................................... 91
Testing the backup recovery............................................................................... 91
Testing Operational Environment ...................................................................... 92
Testing the Database ......................................................................................... 92
Testing the Application ...................................................................................... 93
Logistic of the Test............................................................................................. 93
Future Aspects ........................................................................ 94
Interview Questions ................................................................ 95
What is Next? .................................................................................................... 98
About tutorialspoint.com ........................................................ 99
iii
1
CHAPTER
Overview
This chapter gives a basic idea about Data Warehouse and its need.
T he term "Data Warehouse" was first coined by Bill Inmon in 1990. He said that
Data warehouse is subject Oriented, Integrated, Time-Variant and nonvolatile
collection of data.This data helps in supporting decision making process by analyst in
an organization
The operational database undergoes the per day transactions which causes the
frequent changes to the data on daily basis.But if in future the business executive
wants to analyse the previous feedback on any data such as product,supplier,or the
consumer data. In this case the analyst will be having no data available to analyse
because the previous data is updated due to transactions.
Data warehouse helps the executives to organize, understand and use their
data to take strategic decision.
TUTORIALSPOINT
Simply Easy Learning Page 1
Data warehouse systems available which helps in integration of diversity of
application systems.
Definition
Data warehouse is Subject Oriented, Integrated, Time-Variant and Nonvolatile
collection of data that support management's decision making process.
The operational database is constructed for well known tasks and workload
such as searching particular records, indexing etc but the data warehouse
queries are often complex and it presents the general form of data.
Operational database maintain the current data on the other hand data
warehouse maintain the historical data.
TUTORIALS POINT
Simply Easy Learning Page 2
Non Volatile - Non volatile means that the previous data is not removed when
new data is added to it. The data warehouse is kept separate from the
operational database therefore frequent changes in operational database are
not reflected in data warehouse.
financial services
Banking Services
Consumer goods
Retail sectors.
Controlled manufacturing
Data Mining - Data Mining supports knowledge discovery by finding the hidden
patterns and associations, constructing analytical models, performing classification
and prediction.These mining results can be presented using the visualization tools.
TUTORIALS POINT
Simply Easy Learning Page 3
SN Data Warehouse (OLAP) Operational Database(OLTP)
3 This is used to analysis the business. This is used to run the business.
TUTORIALS POINT
Simply Easy Learning Page 4
2
CHAPTER
Data Warehouse
Concepts
This chapter describes various concepts of datawarehousing.
D ata Warehousing is the process of constructing and using the data warehouse.
The data warehouse is constructed by integrating the data from multiple
heterogeneous sources. This data warehouse supports analytical reporting, structured
and/or ad hoc queries and decision making. Data Warehousing involves data cleaning,
data integration and data consolidations.
TUTORIALS POINT
Simply Easy Learning Page 5
Query Driven Approach
Now these queries are mapped and sent to the local query processor.
The results from heterogeneous sites are integrated into a global answer set.
DISADVANTAGES
The Query Driven Approach needs complex integration and filtering processes.
This approach is also very expensive for queries that require aggregations.
ADVANTAGES
This approach has the following advantages:
Query processing does not require interface with the processing at local
sources.
TUTORIALS POINT
Simply Easy Learning Page 6
Data Warehouse Tools and Utilities Functions
The following are the functions of Data Warehouse tools and Utilities:
Data Extraction - Data Extraction involves gathering the data from multiple
heterogeneous sources.
Data Cleaning - Data Cleaning involves finding and correcting the errors in
data.
TUTORIALS POINT
Simply Easy Learning Page 7
3
CHAPTER
Terminologies
This chapter describes various terminologies used in Data Warehousing.
In this article, we will discuss some of the commonly used terms in Data Warehouse.
Data Warehouse
Non Volatile - Non volatile means that the previous data is not removed when
new data is added to it. The data warehouse is kept separate from the
operational database therefore frequent changes in operational database are
not reflected in data warehouse.
Metadata - Metadata is simply defined as data about data. The data that are
used to represent other data is known as metadata. For example the index of a
book serves as metadata for the contents in the book.In other words we can
say that metadata is the summarized data that lead us to the detailed data.
TUTORIALS POINT
Simply Easy Learning Page 8
In terms of data warehouse we can define metadata as following:
Metadata Respiratory
The Metadata Respiratory is an integral part of data warehouse system. The Metadata
Respiratory contains the following metadata:
Data cube
Data cube help us to represent the data in multiple dimensions. The data cube is
defined by dimensions and facts. The dimensions are the entities with respect to which
an enterprise keeps the records.
The following table represents 2-D view of Sales Data for a company with respect to
time, item and location dimensions.
TUTORIALS POINT
Simply Easy Learning Page 9
But here in this 2-D table we have records with respect to time and item only. The
sales for New Delhi are shown with respect to time and item dimensions according to
type of item sold. If we want to view the sales data with one new dimension say the
location dimension. The 3-D view of the sales data with respect to time, item, and
location is shown in the table below:
The above 3-D table can be represented as 3-D data cube as shown in the following
figure:
TUTORIALS POINT
Simply Easy Learning Page 10
Data mart
Data mart contains the subset of organisation-wide data. This subset of data is
valuable to specific group of an organisation. in other words we can say that data mart
contains only that data which is specific to a particular group. For example the
marketing data mart may contain only data related to item, customers and sales. The
data mart is confined to subjects.
The implementation cycle of data mart is measured in short period of time i.e.
in weeks rather than months or years.
The life cycle of a data mart may be complex in long run if it's planning and
design is not organisation-wide.
TUTORIALS POINT
Simply Easy Learning Page 11
Graphical Representation of data mart.
Virtual Warehouse
The view over an operational data warehouse is known as virtual warehouse. It is easy
to build the virtual warehouse. Building the virtual warehouse requires excess capacity
on operational database servers.
TUTORIALS POINT
Simply Easy Learning Page 12
4
CHAPTER
Delivery Process
This chapter describes delivery process.
Introduction
There should be a delivery process to deliver the data warehouse.But there are many
issues in data warehouse projects that it is very difficult to complete the task and
deliverables in the strict, ordered fashion demanded by waterfall method because the
requirements are hardly fully understood. Hence when the requirements are completed
only then the architectures designs, and build components can be completed.
Delivery Method
The delivery method is a variant of the joint application development approach,
adopted for delivery of data warehouse. We staged the data warehouse delivery
process to minimize the risk. The approach that i will discuss does not reduce the
overall delivery time-scales but ensures business benefits are delivered incrementally
through the development process.
Note: The delivery process is broken into phases to reduce the project and delivery
risk.
TUTORIALS POINT
Simply Easy Learning Page 13
IT Strategy
Data warehouse are strategic investments that require business process to generate
the project benefits. IT Strategy is required to procure and retain funding for the
project.
Business Case
The objective of Business case is to know the projected business benefits that should
be derived from using the data warehouse. These benefits may not be quantifiable but
the projected benefits need to be clearly stated.. If the data warehouse does not have
a clear business case then the business tend to suffer from the credibility problems at
some stage during the delivery process.Therefore in data warehouse project we need
to understand the business case for investment.
The prototype can be thrown away after the feasibility concept has been
shown.
TUTORIALS POINT
Simply Easy Learning Page 14
The activity addresses a small subset of eventual data content if the data
warehouse.
Limit the scope of the first build phase to the minimum that delivers business
benefits.
Understand the short term and medium term requirements of the data
warehouse.
Business Requirements
To provide the quality deliverables we should make sure that overall requirements are
understood. The business requirements and the technical blueprint stages are required
because of the following reasons:
If we understand the business requirements for both short and medium term
then we can design a solution that satisfies the short term need.
Technical Blueprint
This phase need to deliver an overall architecture satisfying the long term
requirements. This phase also deliver the components that must be implemented in a
short term to derive any business benefit. The blueprint need to identify the followings.
TUTORIALS POINT
Simply Easy Learning Page 15
The data retention policy.
History Load
This is the phase where the remainder of the required history is loaded into the data
warehouse. In this phase we do not add the new entities but additional physical tables
would probably be created to store the increased data volumes.
Let's have an example, suppose the build version phase has delivered a retail sales
analysis data warehouse with 2 months worth of history. This information will allow the
user to analyse only the recent trends and address the short term issues. The user can
not identify the annual and seasonal trends. So the 2 years worth of sales history could
be loaded from the archive to make user to analyse the sales trend yearly and
seasonal. Now the 40GB data is extended to 400GB.
Ad hoc Query
In this phase we configure an ad hoc query tool.
Note:It is recommended that not to use these access tolls when database is being
substantially modified.
TUTORIALS POINT
Simply Easy Learning Page 16
Automation
In this phase operational management processes are fully automated. These would
include:
Extending Scope
In this phase the data warehouse is extended to address a new set of business
requirements. The scope can be extended in two ways:
Note:This phase should be performed separately since this phase involves substantial
efforts and complexity.
Requirements Evolution
From the perspective of delivery process the requirement are always changeable. They
are not static.The delivery process must support this and allow these changes to be
reflected within the system.
This issue is addressed by designing the data warehouse around the use of data within
business processes, as opposed to the data requirements of existing queries.
The architecture is designed to change and grow to match the business needs,the
process operates as a pseudo application development process, where the new
requirements are continually fed into the development activities. The partial
deliverables are produced.These partial deliverables are fed back to users and then
reworked ensuring that overall system is continually updated to meet the business
needs.
TUTORIALS POINT
Simply Easy Learning Page 17
5
CHAPTER
System Processes
This chapter gives a basic idea about Data Warehouse system processes.
In this chapter we’ll focus on designing data warehousing solution built on the top
open-system technologies like Unix and relational databases.
TUTORIALS POINT
Simply Easy Learning Page 18
Extract and Load Process
The Data Extraction takes data from the source systems.
Data load takes extracted data and loads it into data warehouse.
Note: Before loading the data into data warehouse the information extracted from
external sources must be reconstructed.
TUTORIALS POINT
Simply Easy Learning Page 19
LOADING THE DATA
After extracting the data it is loaded into a temporary data store.Here in the temporary
data store it is cleaned up and made consistent.
Note: Consistency checks are executed only when all data sources have been loaded
into temporary data store.
Aggregation
Make sure data is consistent with other data within the same data source.
Transforming involves converting the source data into a structure. Structuring the data
will result in increases query performance and decreases operational cost. Information
in data warehouse must be transformed to support performance requirement from the
business and also the ongoing operational cost.
AGGREGATION
Aggregation is required to speed up the common queries. Aggregation rely on the fact
that most common queries will analyse a subset or an aggregation of the detailed data.
TUTORIALS POINT
Simply Easy Learning Page 20
Backup and Archive the data
In order to recover the data in event of data loss, software failure or hardware failure it
is necessary to backed up on regular basis.Archiving involves removing the old data
from the system in a format that allow it to be quickly restored whenever required.
For example in a retail sales analysis data warehouse, it may be required to keep data
for 3 years with latest 6 months data being kept online. In this kind of scenario there is
often requirement to be able to do month-on-month comparisons for this year and last
year. In this case we require some data to be restored from the archive.
This process should also ensure that all system sources are used in most
effective way.
This process does not generally operate during regular load of information into
data warehouse.
TUTORIALS POINT
Simply Easy Learning Page 21
6
CHAPTER
Architecture
This chapter gives a basic idea about Data Warehouse architecture.
I n this article, we will discuss the business analysis framework for data warehouse
design and architecture of a data warehouse.
Since the data warehouse can gather the information quickly and efficiently
therefore it can enhance the business productivity.
The data warehouse provides us the consistent view of customers and items
hence help us to manage the customer relationship.
The data warehouse also helps in bringing cost reduction by tracking trends,
patterns over a long period in a consistent and reliable manner.
To design an effective and efficient data warehouse we are required to understand and
analyze the business needs and construct a business analysis framework. Each
person has different views regarding the design of a data warehouse. These views are
as follows:
The top-down view - This view allows the selection of relevant information
needed for data warehouse.
The data source view - This view presents the information being captured,
stored, and managed by operational system.
The data warehouse view - This view includes the fact tables and dimension
tables.This represents the information stored inside the data warehouse.
The Business Query view - It is the view of the data from the viewpoint of the
end user.
TUTORIALS POINT
Simply Easy Learning Page 22
Three-Tier Data Warehouse Architecture
Generally the data warehouses adopt the three-tier architecture. Following are the
three tiers of data warehouse architecture.
Bottom Tier - The bottom tier of the architecture is the data warehouse database
server.It is the relational database system.We use the back end tools and utilities
to feed data into bottom tier.these back end tools and utilities performs the Extract,
Clean, Load, and refresh functions.
Middle Tier - In the middle tier we have OLAp Server. the OLAP Server can be
implemented in either of the following ways.
Top-Tier - This tier is the front-end client layer. This layer holds the query tools
and reporting tool, analysis tools and data mining tools.
TUTORIALS POINT
Simply Easy Learning Page 23
Data Warehouse Models
From the perspective of data warehouse architecture we have the following data
warehouse models:
Virtual Warehouse
Data mart
Enterprise Warehouse
VIRTUAL WAREHOUSE
The view over a operational data warehouse is known as virtual warehouse. It
is easy to built the virtual warehouse.
DATA MART
Data mart contains the subset of organisation-wide data.
Note: in other words we can say that data mart contains only that data which is specific
to a particular group. For example the marketing data mart may contain only data
related to item, customers and sales. The data mart are confined to subjects.
Window based or Unix/Linux based servers are used to implement data marts.
They are implemented on low cost server.
The implementation cycle of data mart is measured in short period of time i.e.
in weeks rather than months or years.
The life cycle of a data mart may be complex in long run if it's planning and
design is not organisation-wide.
TUTORIALS POINT
Simply Easy Learning Page 24
ENTERPRISE WAREHOUSE
The enterprise warehouse collects all the information all the subjects spanning
the entire organization
Load Manager
This Component performs the operations required to extract and load process.
The size and complexity of load manager varies between specific solutions
from data warehouse to data warehouse.
Perform simple transformations into structure similar to the one in the data
warehouse.
TUTORIALS POINT
Simply Easy Learning Page 25
EXTRACT DATA FROM SOURCE
The data is extracted from the operational databases or the external information
providers. Gateways are the application programs that are used to extract data. It is
supported by underlying DBMS and allows client program to generate SQL to be
executed at a server. Open Database Connection (ODBC), Java Database Connection
(JDBC), are examples of gateway.
FAST LOAD
In order to minimize the total load window the data need to be loaded into the
warehouse in the fastest possible time.
It is more effective to load the data into relational database prior to applying
transformations and checks.
SIMPLE TRANSFORMATIONS
While loading it may be required to perform simple transformations. After this has been
completed we are in position to do the complex checks. Suppose we are loading the
EPOS sales transaction we need to perform the following checks:
Strip out all the columns that are not required within the warehouse.
TUTORIALS POINT
Simply Easy Learning Page 26
Warehouse Manager
Warehouse manager is responsible for the warehouse management process.
Backup/Recovery tool
SQL Scripts
TUTORIALS POINT
Simply Easy Learning Page 27
Creates the indexes, business views, partition views against the base data.
Generates the new aggregations and also updates the existing aggregation.
Generates the normalizations.
Warehouse Manager archives the data that has reached the end of its
captured life.
Note: Warehouse Manager also analyses query profiles to determine index and
aggregations are appropriate.
Query Manager
Query Manager is responsible for directing the queries to the suitable tables.
By directing the queries to appropriate table the query request and response
process is speed up.
Stored procedures.
TUTORIALS POINT
Simply Easy Learning Page 28
Detailed information
The following diagram shows the detailed information
The detailed information is not kept online rather is aggregated to the next level of
TUTORIALS POINT
Simply Easy Learning Page 29
detail and then archived to the tape. The detailed infomation part of data warehouse
keep the detailed information in the starflake schema. the detailed information is
loaded into the data warehouse to supplement the aggregated data.
Note: If the detailed information is held offline to minimize the disk storage we should
make sure that the data has been extracted, cleaned up, and transformed then into
starflake schema before it is archived.
Summary Information
In this area of data warehouse the predefined aggregations are kept.
This area changes on ongoing basis in order to respond to the changing query
profiles.
It needs to be updated whenever new data is loaded into the data warehouse.
It may not have been backed up, since it can be generated fresh from the
detailed information.
TUTORIALS POINT
Simply Easy Learning Page 30
7
CHAPTER
OLAP
This chapter gives a basic idea about Data Warehouse OLAP - OnLine Analytic
Processing server.
Introduction
Online Analytical Processing Server (OLAP) is based on multidimensional data model.
It allows the managers, analysts to get insight the information through fast, consistent,
interactive access to information. In this chapter we will discuss about types of OLAP,
operations on OLAP, Difference between OLAP and Statistical Databases and OLTP.
Relational OLAP(ROLAP)
TUTORIALS POINT
Simply Easy Learning Page 31
Additional tools and services.
OLAP Operations
As we know that the OLAP server is based on the multidimensional view of data hence
we will discuss the OLAP operations in multidimensional data.
Roll-up
Drill-down
Pivot (rotate)
ROLL-UP
This operation performs aggregation on a data cube in any of the following way:
By dimension reduction.
TUTORIALS POINT
Simply Easy Learning Page 32
The roll-up operation is performed by climbing up a concept hierarchy for the
dimension location.
Initially the concept hierarchy was "street < city < province < country".
When roll-up operation is performed then one or more dimensions from the
data cube are removed.
DRILL-DOWN
Drill-down operation is reverse of the roll-up. This operation is performed by either of
the following way:
TUTORIALS POINT
Simply Easy Learning Page 33
The drill-down operation is performed by stepping down a concept hierarchy
for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drill-up the time dimension is descended from the level quarter to the level
of month.
When drill-down operation is performed then one or more dimensions from the
data cube are added.
It navigates the data from less detailed data to highly detailed data.
SLICE
The slice operation performs selection of one dimension on a given cube and gives us
a new sub cube. Consider the following diagram showing the slice operation.
TUTORIALS POINT
Simply Easy Learning Page 34
The Slice operation is performed for the dimension time using the criterion time
="Q1".
DICE
The Dice operation performs selection of two or more dimension on a given cube and
gives us a new subcube. Consider the following diagram showing the dice operation:
TUTORIALS POINT
Simply Easy Learning Page 35
The dice operation on the cube based on the following selection criteria that involve
three dimensions.
PIVOT
The pivot operation is also known as rotation.It rotates the data axes in view in order to
provide an alternative presentation of data.Consider the following diagram showing the
pivot operation.
TUTORIALS POINT
Simply Easy Learning Page 36
In this the item and location axes in 2-D slice are rotated.
TUTORIALS POINT
Simply Easy Learning Page 37
OLAP vs OLTP
SN Data Warehouse (OLAP) Operational Database(OLTP)
3 This is used to analysis the business. This is used to run the business.
TUTORIALS POINT
Simply Easy Learning Page 38
8
CHAPTER
Relational OLAP
This chapter gives a basic idea about relational OLAP in Data warehouse.
Introduction
T he Relational OLAP servers are placed between relational back-end server and
client front-end tools. To store and manage warehouse data the Relational OLAP use
relational or extended-relational DBMS.
Points to remember
The ROLAP tools need to analyze large volume of data across multiple
dimensions.
The ROLAP tools need to store and analyze highly volatile and changeable
data.
Database Server
ROLAP Server
TUTORIALS POINT
Simply Easy Learning Page 39
Front end tool
Advantages
The ROLAP servers are highly scalable.
Disadvantages
Poor query performance.
TUTORIALS POINT
Simply Easy Learning Page 40
9
CHAPTER
Multidimentional OLAP
This chapter gives a basic idea about multidimentional OLAP.
Introduction
Points to remember:
MOLAP tools need to process information with consistent response time
regardless of level of summarizing or calculations selected.
MOLAP Server adopts two level of storage representation to handle dense and
sparse data sets.
MOLAP Architecture
MOLAP includes the following components.
Database server
MOLAP server
TUTORIALS POINT
Simply Easy Learning Page 41
Front end tool
Advantages
Here is the list of advantages of Multidimensional OLAP
Helps the user who is connected to a network and need to analyze larger, less
defined data.
Disadvantages
MOLAP are not capable of containing detailed data.
MOLAP vs ROLAP
SN MOLAP ROLAP
TUTORIALS POINT
Simply Easy Learning Page 42
CHAPTER
10
Schema
This chapter gives a basic idea about Schema
Introduction
The schema is a logical description of the entire database. The schema includes the
name and description of records of all record types including all associated data-items
and aggregates. Likewise the database the data warehouse also require the schema.
The database uses the relational model on the other hand the data warehouse uses
the Stars, snowflake and fact constellation schema. In this chapter we will discuss the
schemas used in data warehouse.
TUTORIALS POINT
Simply Easy Learning Page 43
Star Schema
In star schema each dimension is represented with only one dimension table.
In the following diagram we have shown the sales data of a company with
respect to the four dimensions namely, time, item, branch and location.
There is a fact table at the centre. This fact table contains the keys to each of
four dimensions.
The fact table also contain the attributes namely, dollars sold and units sold.
Note: Each dimension has only one dimension table and each table holds a set of
attributes. For example the location dimension table contains the attribute set
{location_key, street, city, province_or_state, country}. This constraint may cause data
redundancy. For example the "Vancouver" and "Victoria" both cities are both in
Canadian province of British Columbia. The entries for such cities may cause data
redundancy along the attributes province_or_state and country.
TUTORIALS POINT
Simply Easy Learning Page 44
Snowflake Schema
In Snowflake schema some dimension tables are normalized.
Therefore now the item dimension table contains the attributes item_key,
item_name, type, brand, and supplier-key.
The supplier key is linked to supplier dimension table. The supplier dimension table
contains the attributes supplier_key, and supplier_type.
TUTORIALS POINT
Simply Easy Learning Page 45
Fact Constellation Schema
In fact Constellation there are multiple fact tables. This schema is also known as
galaxy schema.
In the following diagram we have two fact tables namely, sales and shipping.
The shipping fact table has the five dimensions namely, item_key, time_key,
shipper-key, from-location.
The shipping fact table also contains two measures namely, dollars sold and units
sold.
It is also possible for dimension table to share between fact tables. For example
time, item and location dimension tables are shared between sales and shipping
fact table.
TUTORIALS POINT
Simply Easy Learning Page 46
Schema Definition
The Multidimensional schema is defined using Data Mining Query Language( DMQL).
the two primitives namely, cube definition and dimension definition can be used for
defining the Data warehouses and data marts.
TUTORIALS POINT
Simply Easy Learning Page 47
define dimension item as (item key, item name, brand, type,
supplier
(supplier key, supplier type))
define dimension branch as (branch key, branch name, branch
type)
define dimension location as (location key, street, city
(city key, city, province or state, country))
TUTORIALS POINT
Simply Easy Learning Page 48
CHAPTER
11
Partioning Strategy
This chapter gives a basic idea about portioning strategy
Introduction
Why to Partition
Here is the list of reasons.
To assist backup/recovery
To enhance performance
TO ASSIST BACKUP/RECOVERY
If we do not have partitioned the fact table then we have to load the complete fact table
with all the data.Partitioning allow us to load that data which is required on regular
TUTORIALS POINT
Simply Easy Learning Page 49
basis. This will reduce the time to load and also enhances the performance of the
system.
Note: To cut down on the backup size all partitions other than the current partitions
can be marked read only. We can then put these partition into a state where they can
not be modified.Then they can be backed up .This means that only the current partition
is to be backed up.
TO ENHANCE PERFORMANCE
By partitioning the fact table into sets of data the query procedures can be enhanced.
The query performance is enhanced because now the query scans the partitions that
are relevant. It does not have to scan the large amount of data.
Horizontal Partitioning
There are various way in which fact table can be partitioned. In horizontal partitioning
we have to keep in mind the requirements for manageability of the data warehouse.
TUTORIALS POINT
Simply Easy Learning Page 50
The number of physical tables is kept relatively small, which reduces the
operating cost.
This technique is suitable where the mix of data dipping recent history, and
data mining through entire history is required.
Suppose a market function which is structured into distinct regional departments for
example state by state basis. If each region wants to query on information captured
within its region, it would proves to be more effective to partition the fact table into
regional partitions. This will cause the queries to speed up because it does not require
to scan information that is not relevant.
Since the query does not have to scan the irrelevant data which speed up
the query process.
If the dimension changes then the entire fact table would have to be
repartitioned.
TUTORIALS POINT
Simply Easy Learning Page 51
Note: This partitioning required metadata to identify what data stored in each partition.
PARTITIONING DIMENSIONS
If the dimension contain the large number of entries then it is required to partition
dimensions. Here we have to check the size of dimension.
Suppose a large design which changes over time. If we need to store all the variations
in order to apply comparisons, that dimension may be very large. This would definitely
affect the response time.
Vertical Partition
In Vertical Partitioning the data is split vertically.
TUTORIALS POINT
Simply Easy Learning Page 52
The Vertical Partitioning can be performed in the following two ways.
Normalization
Row Splitting
NORMALIZATION
Normalization method is the standard relational method of database organization. In
this method the rows are collapsed into single row, hence reduce the space.
Regi
Product_id Quantity Value sales_date Store_id Store_name Location
on
16 sunny Bangalore W
64 san Mumbai S
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
TUTORIALS POINT
Simply Easy Learning Page 53
ROW SPLITTING
The row splitting tends to leave a one-to-one map between partitions. The motive of
row splitting is to speed the access to large table by reducing its size.
Note: while using vertical partitioning make sure that there is no requirement to
perform major join operations between two partitions.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be
region
transaction_date
Now suppose the business is organised in 30 geographical regions and each region
have different number of branches.That will give us 30 partitions, which is reasonable.
This partitioning is good enough because our requirements capture has shown that
vast majority of queries are restricted to the user's own business region.
Now If we partition by transaction_date instead of region. Then it means that the latest
transaction from every region will be in one partition. Now the user who wants to look
at data within his own region has to query across multiple partition.
TUTORIALS POINT
Simply Easy Learning Page 54
CHAPTER
12
Metadata Concepts
This chapter gives a basic idea about Metadata and its concepts.
What is Metadata
Metadata is simply defined as data about data. The data that are used to represent
other data is known as metadata. For example the index of a book serve as metadata
for the contents in the book. In other words we can say that metadata is the
summarized data that leads us to the detailed data. In terms of data warehouse we can
define metadata as following.
Note: In data warehouse we create metadata for the data names and definitions of a
given data warehouse. Along with this metadata additional metadata are also created
for timestamping any extracted data, the source of extracted data.
Categories of Metadata
The metadata can be broadly categorized into three categories:
TUTORIALS POINT
Simply Easy Learning Page 55
Role of Metadata
Metadata has very important role in data warehouse. The role of metadata in
warehouse is different from the warehouse data yet it has very important role. The
various roles of metadata are explained below.
This directory helps the decision support system to locate the contents of
data warehouse.
Metadata helps in decision support system for mapping of data when data
are transformed from operational environment to data warehouse
environment.
TUTORIALS POINT
Simply Easy Learning Page 56
Metadata Respiratory
The Metadata Respiratory is an integral part of data warehouse system. The Metadata
Respiratory has the following metadata:
TUTORIALS POINT
Simply Easy Learning Page 57
Challenges for Metadata Management
The importance of metadata can not be overstated. Metadata helps in driving the
accuracy of reports, validates data transformation and ensures the accuracy of
calculations. The metadata also enforces the consistent definition of business terms to
business end users. With all these uses of Metadata it also has challenges for
metadata management. The some of the challenges are discussed below.
The Metadata in a big organization is scattered across the organization. This metadata
is spreaded in spreadsheets, databases, and applications.
The metadata could present in text file or multimedia file. To use this data for
information management solution, this data need to be correctly defined.
There are no industry wide accepted standards. The data management solution
vendors have narrow focus.
TUTORIALS POINT
Simply Easy Learning Page 58
CHAPTER
13
Data Marting
This chapter gives a basic idea about Data Marting.
Note: Donot data mart for any other reason since the operation cost of data marting
could be very high. Before data marting, make sure that data marting strategy is
appropriate for your particular solution.
TUTORIALS POINT
Simply Easy Learning Page 59
IDENTIFY THE FUNCTIONAL SPLITS
In this step we determine that whether the natural functional split is there in the
organization. We look for departmental splits, and we determine whether the way in
which department use information tends to be in isolation from the rest of the
organization. Let's have an example...
As the merchant is not interested in the products they are not dealing with, so the data
marting is subset of the data dealing which the product group of interest. Following
diagram shows data marting for different users.
TUTORIALS POINT
Simply Easy Learning Page 60
The merchant could query the sales trend of other products to analyse
what is happening to the sales.
These are issues that need to be taken into account while determining the functional
split.
Note: we need to determine the business benefits and technical feasibility of using
data mart.
There are some tools that populated directly from the source system but some can not.
Therefore additional requirements outside the scope of the tool are needed to be
identified for future.
Note: In order to ensure consistency of data across all access tools the data should
not be directly populated from the data warehouse rather each tool must have its own
data mart.
Data mart allows us to build complete wall by physically separating data segments
within the data warehouse. To avoid possible privacy problems the detailed data can
be removed from the data warehouse.We can create data mart for each legal entity
and load it via data warehouse, with detailed account data.
TUTORIALS POINT
Simply Easy Learning Page 61
The summaries are data marted in the same way as they would have been designed
within the data warehouse. Summary tables help to utilize all dimension data in the
starflake schema.
Network Access
Note: The data marting is more expensive than aggregations therefore it should be
used as an additional strategy not as an alternative strategy.
TUTORIALS POINT
Simply Easy Learning Page 62
NETWORK ACCESS
The data mart could be on different locations from the data warehouse so we should
ensure that the LAN or WAN has the capacity to handle the data volumes being
transferred within the data mart load process.
Network Capacity.
TUTORIALS POINT
Simply Easy Learning Page 63
CHAPTER
15
System Managers
This chapter gives a basic idea about Data Warehouse System Managers.
Introduction
TUTORIALS POINT
Simply Easy Learning Page 64
System Scheduling Manager
The System Scheduling Manager is also responsible for the successful implementation
of the data warehouse. The purpose of this scheduling manager is to schedule the ad
hoc queries. Every operating system has its own scheduler with some form of batch
control mechanism. Features of System Scheduling Manager are following.
Note: The above are the evaluation parameters for evaluation of a good scheduler.
Some important jobs that the scheduler must be able to handle are as followed:
Data load
Data Processing
Index creation
Backup
Aggregation creation
data transformation
TUTORIALS POINT
Simply Easy Learning Page 65
Note: If the data warehouse is running on a cluster or MPP architecture, then the
system scheduling manager must be capable of running across the architecture.
Note: The Event manager monitors the events occurrences and deal with them. the
event manager also track the myriad of things that can go wrong on this complex data
warehouse system.
EVENTS
The question arises is what is an event? Event is nothing but the action that are
generated by the user or the system itself. It may be noted that the event is
measurable, observable, occurrence of defined action.
The following are the common events that are required to be tracked.
Hardware failure.
A process dying.
The most important thing about is that they should be capable of executing on their
own. There event packages that defined the procedures for the predefined events. The
code associated with each event is known as event handler. This code is executed
whenever an event occurs.
TUTORIALS POINT
Simply Easy Learning Page 66
System and Database Manager
System and Database manager are the two separate piece of software but they do the
same job. The objective of these tools is to automate the certain processes and to
simplify the execution of others. The Criteria of choosing the system and database
manager are an abitlity to:
TUTORIALS POINT
Simply Easy Learning Page 67
with the schedule manager software being used. The important features that are
required for the management of backups are following.
Scheduling
Database awareness.
The backup are taken only to protect the data against loss. Following are
the important points to remember.
The backup software will keep some from of database of where and when
the piece of data was backed up.
The backup recovery manager must have a good front end to that
database.
TUTORIALS POINT
Simply Easy Learning Page 68
CHAPTER
16
Process Managers
This chapter gives a basic idea about Data Warehouse Process Managers.
The size and complexity of load manager varies between specific solutions
from data warehouse to data warehouse.
Perform simple transformations into structure similar to the one in the data
warehouse.
TUTORIALS POINT
Simply Easy Learning Page 69
EXTRACT DATA FROM SOURCE
The data is extracted from the operational databases or the external information
providers. Gateways are the application programs that are used to extract data. It is
supported by underlying DBMS and allows client program to generate SQL to be
executed at a server. Open Database Connection (ODBC), Java Database Connection
(JDBC), is examples of gateway.
FAST LOAD
In order to minimize the total load window the data need to be loaded into
the warehouse in the fastest possible time.
SIMPLE TRANSFORMATIONS
While loading it may be required to perform simple transformations. After this has been
completed we are in position to do the complex checks. Suppose we are loading the
EPOS sales transaction we need to perform the following checks.
Strip out all the columns that are not required within the warehouse.
TUTORIALS POINT
Simply Easy Learning Page 70
Convert all the values to required data types.
Warehouse Manager
Warehouse manager is responsible for the warehouse management
process.
Backup/Recovery tool
SQL Scripts
TUTORIALS POINT
Simply Easy Learning Page 71
OPERATIONS PERFORMED BY
WAREHOUSE MANAGER
Warehouse manager analyses the data to perform consistency and
referential integrity checks.
Creates the indexes, business views, partition views against the base
data.
Warehouse Manager archives the data that has reached the end of its
captured life.
Note: Warehouse Manager also analyses query profiles to determine index and
aggregations are appropriate.
Query Manager
Query Manager is responsible for directing the queries to the suitable
tables.
Stored procedures.
TUTORIALS POINT
Simply Easy Learning Page 72
OPERATIONS PERFORMED BY QUERY
MANAGER
Query Manager direct to the appropriate tables.
Query Manager schedules the execution of the queries posed by the end
user.
TUTORIALS POINT
Simply Easy Learning Page 73
CHAPTER
17
Security
This chapter gives a basic idea about Data Warehouse Security.
Introduction
The data from each analyst can be summarised and passed onto management where
the different summarise can be created. As the aggregations of summaries cannot be
same as that of aggregation as a whole so It is possible to miss some information
trends in the data unless someone is analysing the data as a whole.
Requirements
Adding the security will affect the performance of the data warehouse, therefore it is
worth determining the security requirements early as possible. Adding the security after
the data warehouse has gone live, is very difficult.
During the design phase of data warehouse we should keep in mind that what data
sources may be added later and what would be the impact of adding those data
sources. We should consider the following possibilities during the design phase.
Whether the new data sources will require new security and/or audit
restrictions to be implemented?
Whether the new users added who have restricted access to data that is
already generally available?
This situation arises when the future users and the data sources are not well known. In
such a situation we need to use the knowledge of business and the objective of data
warehouse to know likely requirements.
TUTORIALS POINT
Simply Easy Learning Page 74
Factor to Consider for Security requirements
The following are the parts that are affected by the security hence it is worth consider
these factors.
User Access
Data Load
Data Movement
Query Generation
USER ACCESS
We need to classify the data first and then the users by what data they can access.In
other word the users are classified according to the data, they can access.
Data Classification
The following are the two approaches that can be used to classify the data:
The data can be classified according to its sensitivity. The highly sensitive
data is classified as highly restricted and less sensitive data is classified as
less restrictive.
The data can also be classified according to the job function. This
restriction allows only the specific users to view particular data. In this we
restrict the users to view only that that in which they are interested and are
responsible for.
There are some issues in the second approach. To understand let's have an example,
suppose you are building the data warehouse for a bank. Suppose further that data
being stored in the data warehouse is the transaction data for all the accounts. The
question here is who is allowed to see the transaction data. The solution lies in
classifying the data according to the function.
User classification
The following are the approaches that can be used to classify the users.
The user can also be classified according to their role, with people
grouped across departments based on their role.
TUTORIALS POINT
Simply Easy Learning Page 75
Classification on basis of Department
Let's have an example of a data warehouse where the users are from sales and
marketing department. we can design the security by topdown company view, with
access centered around the different departments. But they could be some restrictions
on users at different level. This structure is shown in the following diagram.
But if each department accesses the different data then we should design the security
access for each department separately. This can be achieved by the departmental
data marts. Since these data marts are separated from the data warehouse hence we
can enforce the separate security restrictions on each data mart. This approach is
shown in the following figure.
TUTORIALS POINT
Simply Easy Learning Page 76
Classification on basis of Role
If the data is generally available to all the departments.The it is worth to follow the role
access hierarchy. In other words if the data is generally accessed by all the
departments the apply the security restrictions as per the role of the user. The role
accesshierarchy is shown in the following figure.
AUDIT REQUIREMENTS
The auditing is a subset of security. The auditing is a costly activity therefore it is worth
understanding the audit requirements and reason for each audit requirement. The
auditing can cause the heavy overheads on the system. To complete auditing in time
we require the more hardware therefore it is recommended that where possible,
auditing should be switch off. Audit requirements can be categorized into the following:
TUTORIALS POINT
Simply Easy Learning Page 77
Connections
Disconnections
Data access
Data change
Note: For each of the above mentioned categories it is necessary to audit success,
failure or both. From the perspective of security reasons the auditing of failures are
very important. The auditing of failure are important because they can highlight the
unauthorised or fraudulent access.
NETWORK REQUIREMENTS
The Network security is as important as other securities. We can not ignore the
network security requirement. We need to consider the following issues.
Are there restrictions on which network routes the data can take?
DATA MOVEMENT
There exist potential security implications while moving the data. Suppose we need to
transfer some restricted data as a flat file to be loaded. When the data is loaded into
the data warehouse the following questions are raised?
If we talk about the backup of these flat files the following questions are raised?
TUTORIALS POINT
Simply Easy Learning Page 78
Who has access to these tapes?
Some other form of data movement like query result sets also need to be
considered. The question here are raised when creating the temporary
table are as follows.
We should avoid the accidental flouting of security restrictions. If a user with access to
the restricted data can generate accessible temporary tables, data can be made visible
to nonauthorized users. We can overcome it by having separate temporary area for
users with access to restricted data.
Documentation
The audit and security requirements need to be properly documented. This will be
treated as part of justification. This document can contain all the information gathered
on the following.
Data classification
User classification
Network requirements
Application development
Database design
Testing
TUTORIALS POINT
Simply Easy Learning Page 79
APPLICATION DEVELOPMENT
The security affect the overall application development and it also affect the design of
the important components of the data warehouse such as load manager, warehouse
manager and the query manager. The load manager may require checking code to
filter record and place them in different locations. The more transformation rule may
also be required to hide certain data . Also there may be requirement of extra metadata
to handle any extra objects.
To create and maintain the extra vies the warehouse manager may require extra code
to enforce the security. There may be the requirement of the extra checks coded into
the data warehouse to prevent it from being fooled into moving data into location where
it should not be available. The query manager requires the changes to handle any
access restrictions. The query manager will need to be aware of all extra views and
aggregations.
DATABASE DESIGN
The database layout is also affected because when the security is added there is
increase in number of views and tables. Adding security adds the size to the database
and hence increases the complexity of the database design and management. it will
also add complexity to the backup management and recovery plan.
TESTING
The testing of the data warehouse is very complex and a lengthy process. Adding
security to the data warehouse also affect the testing time complexity. It affects the
testing in the following two ways.
It will increase the time required for integration and system testing.
TUTORIALS POINT
Simply Easy Learning Page 80
CHAPTER
18
Backup
This chapter gives a basic idea about Data Warehouse backup.
Introduction
T here exist large volume of data into the data warehouse and the data warehouse
system is very complex hence it becomes important to have backup of all the data
which is available for the recovery in future as per the requirement. In this chapter I will
discuss the issues on designing backup strategy.
Backup Terminologies
Before proceeding further we should know some of the backup terminologies
discussed below.
Cold backup - Cold backup is taken while the database is completely shut
down. In multiinstance environment all the instances should be shut down.
Hot backup - The hot backup is take when the database engine is up and
running. Hot backup requirements that need to be considered vary from
RDBMS to RDBMS. Hot backups are extremely useful.
TUTORIALS POINT
Simply Easy Learning Page 81
Hardware Backup
It is important to decide which hardware to use for the backup.We have to make the
upper bound on the speed at which backup is can be processed. the speed of
processing backup and restore depends not only on the hardware being use rather it
also depends upon the how hardware is connected, bandwidth of the network, backup
software and speed of server's I/O system. Here I will discuss about some of the
hardware choices that are available and their pros and cons. These choices are as
follows.
Tape Technology
Disk Backups
TAPE TECHNOLOGY
The tape choice can be categorized into the following.
Tape media
Tape stackers
Tape silos
Tape Media
There exist several varieties of tape media. The some tape media standard is listed in
the table below:
DLT 40 GB 3 MB/s
8 mm 14 GB 1 MB/s
Scalability.
TUTORIALS POINT
Simply Easy Learning Page 82
Shelf life of tape medium.
As as networkavailable devices.
Suppose the server is the 48node MPP machine so which node do you
connect the tape drive, how do you spread them over the server nodes to
get the optimal performance with least disruption of the server and least
internal I/O latency?
Connecting the tape drives remotely also require the high bandwidth.
TAPE STACKERS
The method of loading the multiple tapes into a single tape drive is known as tape
stackers. The stacker dismounts the current tape when it has finished with it and load
the next tape hence only one tape is available data a time to be accessed.The price
and the capabilities may vary but the common ability is that they can perform
unattended backups.
TAPE SILOS
The tape silos provide the large store capacities.Tape silos can store and manage the
thousands of tapes. The tape silos can integrate the multiple tape drives. They have
the software and hardware to label and store the tapes they store. It is very common
for the silo to be connected remotely over a network or a dedicated link.We should
ensure that the bandwidth of that connection is up to the job.
Other Technologies
The technologies other than the tape are mentioned below.
Disk Backups
Optical jukeboxes
TUTORIALS POINT
Simply Easy Learning Page 83
DISK BACKUPS
Methods of disk backups are listed below.
Disk-to-disk backups
Mirror breaking
These methods are used in OLTP system. These methods minimize the database
downtime and maximize the availability.
Disk-to-disk backups
In this kind of backup the backup is taken on to disk rather than to tape. Reasons for
doing Disktodisk backups are.
Speed of restore
Backing up the data from Disk to disk is much faster than to the tape. However it is the
intermediate step of backup later the data is backed up on the tape. The other
advantage of Disk to disk backups is that it gives you the online copy of the latest
backup.
Mirror Breaking
The idea is to have disks mirrored for resilience during the working day. When back is
required one of the mirror sets can be broken out. This technique is variat of Disktodisk
backups.
Note: The database may need to be shutdown to guarantee the consistency of the
backup.
OPTICAL JUKEBOXES
Optical jukeboxes allow the data to be stored near line. This technique allow large
number of optical disks to be managed in same way as a tape stacker or tape silo. The
drawback of this technique is that it is slow write speed than disks. But the optical
media provide the long life and reliability make them good choice of medium of
archiving.
Software Backups
There are software tools available which helps in backup process. These software
tools come as a package.These tools not only take backup in fact they effectively
manage and control the backup strategies. There are many software packages
available in the market .Some of them are here listed in the following table.
TUTORIALS POINT
Simply Easy Learning Page 84
Package Name Vendor
Networker Legato
ADSM IBM
Omniback II HP
Alexandria Sequent
Does the package have client server option, or must it run on database
server itself?
TUTORIALS POINT
Simply Easy Learning Page 85
CHAPTER
19
Tuning
This chapter gives a basic idea about Data Warehouse Tuning.
Introduction
The data warehouse never remains constant throughout the period of time.
It is very difficult to predict that what query the user is going to produce in
future.
The users and their profile never remains the same with time.
Performance Assessment
Here is the list of objective measures of performance.
TUTORIALS POINT
Simply Easy Learning Page 86
Scan rates.
It is of no use to trying to tune response time if they are already better than
those required.
To hide the complexity of the system from the user the aggregations and
views should be used.
It is also possible that the user can write a query you had not tuned for.
Note: If there is delay in transferring the data or in arrival of data then the entire system
is effected badly. Therefore it is very important to tune the data load first.
There are various approaches of tuning data load that are discussed below:
The very common approach is to insert data using the SQL Layer. In this
approach the normal checks and constraints need to be performed. When
the data is inserted into the table the code will run to check is there enough
space available to insert the data. if the sufficient space is not available
then more space may have to be allocated to these tables. These checks
take time to perform and are costly to CPU. But pack the data tightly by
making maximal use of space.
The second approach is to bypass all these checks and constraints and
place the data directly into preformatted blocks. These blocks are later
written to the database. It is faster than the first approach but it can work
only with the whole blocks of data. This can lead to some space wastage.
TUTORIALS POINT
Simply Easy Learning Page 87
The third approach is that while loading the data into the table that already
contains the table, we can either maintain the indexes.
The fourth approach says that to load the data in tables that already
contains the data, drop the indexes & recreate them when the data load
is complete. Out of third and fourth, which approach is better depends on
how much data is already loaded and how many indexes need to be
rebuilt.
Integrity Checks
The integrity checking highly affects the performance of the load
Tuning Queries
We have two kinds of queries in data warehouse:
Fixed Queries
Ad hoc Queries
FIXED QUERIES
The fixed queries are well defined. The following are the examples of fixed queries.
regular reports
Canned queries
Common aggregations
Note: We cannot do more on fact table but while dealing with the dimension table or
the aggregations, the usual collection of SQL tweaking, storage mechanism and
access methods can be used to tune these queries.
TUTORIALS POINT
Simply Easy Learning Page 88
AD HOC QUERIES
To know the ad hoc queries it is important to know the ad hoc users of the data
warehouse. Here is the list of points that need to understand about the users of the
data warehouse:
It is important to track the users profiles and identify the queries that are
run on regular basis.
Identify the similar and ad hoc queries that are frequently run.
If these queries are identified then the database will change and new
indexes can be added for those queries.
TUTORIALS POINT
Simply Easy Learning Page 89
CHAPTER
20
Testing
This chapter gives a basic idea about Data Warehouse Testing.
Introduction
Testing is very important for data warehouse systems to make them work correctly and
efficiently. There are three basic level of testing that is listed below:
Unit Testing
Integration Testing
System testing
UNIT TESTING
In the Unit Testing each component is separately tested.
In this kind of testing each module i.e. procedure, program, SQL Script,
Unix shell is tested.
INTEGRATION TESTING
In this kind of testing the various modules of the application are brought
together and then tested against number of inputs.
SYSTEM TESTING
In this kind of testing the whole data warehouse application is tested
together.
TUTORIALS POINT
Simply Easy Learning Page 90
The purpose of this testing is to check whether the entire system work
correctly together or not.
Since the size of the whole data warehouse is very large so it is usually
possible to perform minimal system testing before the test plan proper can
be enacted.
Test Schedule
First of all the Test Schedule is created in process of development of Test
Plan.
In this we predict the estimated time required for the testing of entire data
warehouse system.
A simple problem may have large size of query which can take a day or
more to complete i.e. the query does not complete in desired time scale.
There may be the hardware failure such as losing a disk, or the human
error such as accidentally deleting the table or overwriting a large table.
Note: Due to the above mentioned difficulties it is recommended that always double
the amount of time you would normally allow for testing.
Media failure.
Instance failure.
TUTORIALS POINT
Simply Easy Learning Page 91
Loss or damage of table.
o Event manager
o System Manager.
o Database Manager.
o Configuration Manager
Testing the database manager and monitoring tools. - To test the database
manager and the monitoring tools they should be used in the creation, running
and management of test database.
Testing database features. - Here is the list of features that we have to test:
o Querying in parallel
TUTORIALS POINT
Simply Easy Learning Page 92
Testing database performance. - Query execution plays a very important role
in data warehouse performance measures. There are set of fixed queries that
need to be run regularly and they should be tested. To test ad hoc queries one
should go through the user requirement document and understand the
business completely. Take the time to test the most awkward queries that the
business is likely to ask against different index and aggregation strategies.
Scheduling Software
Overnight processing
Query Performance
Note: The most important point is to test the scalability. Failure to do so will leave us a
system design that does not work when the system grow.
TUTORIALS POINT
Simply Easy Learning Page 93
CHAPTER
21
Future Aspects
This chapter gives a basic idea about Data Warehouse Future aspects.
As we have seen that the size of the open database has grown approximately double
the magnitude in last few years. This change in magnitude is of greater significance.
As the size of the databases grows, the estimates of what constitutes a very large
database continue to grow.
The Hardware and software that are available today do not allow keeping a large
amount of data online. For example a Telco call record requires 10TB of data to be
kept online which is just a size of one month record. If It require to keep record of
sales, marketing customer, employee etc. then the size will be more than 100 TB.
The record not only contains the textual information but also contain some multimedia
data. Multimedia data cannot be easily manipulated as text data. Searching the
multimedia data is not an easy task whereas the textual information can be retrieved
by the relational software available today.
Apart from size planning, building and running ever-larger data warehouse systems are
very complex. As the number of users increases the size of the data warehouse also
increases. These users will also require to access to the system.
Hence the Future shape of data warehouse will be very different from what is
being created today.
TUTORIALS POINT
Simply Easy Learning Page 94
CHAPTER
22
Interview Questions
This chapter provide questionnaire on Data warehouse.
Dear readers, these Data Warehousing Interview Questions have been designed
especially to get you acquainted with the nature of questions you may encounter
during your interview for the subject ofData Warehousing. As per my experience,
good interviewers hardly planned to ask any particular question during your interview,
normally questions start with some basic concept of the subject and later they continue
based on further discussion and what you answer:
Q: What is the very basic difference between data warehouse and Operational
Databases?
A: Data warehouse contains the historical information that is made available for
analysis of the business whereas the Operational database contains the current
information that is required to run the business.
TUTORIALS POINT
Simply Easy Learning Page 95
Q: List the functions of data warehouse tools and utilities?
A: The functions performed by Data warehouse tool and utilities are Data Extraction,
Data Cleaning, Data Transformation, Data Loading and Refreshing.
Q: Define Metadata?
A: Metadata is simply defined as data about data. In other words we can say that
metadata is the summarized data that lead us to the detailed data.
Q: Define Dimension?
A: The dimensions are the entities with respect to which an enterprise keeps the
records.
TUTORIALS POINT
Simply Easy Learning Page 96
Q: Define functions of Warehouse Manager?
A: The Warehouse Manager performs consistency and referential integrity checks,
Creates the indexes, business views, partition views against the base data, transforms
and merge the source data into the temporary store into the published data
warehouse, Backup the data in the data warehouse and archives the data that has
reached the end of its captured life.
Q: What is Normalization?
A: The normalization split up the data into additional tables.
TUTORIALS POINT
Simply Easy Learning Page 97
Q: What kind of costs is involved in Data Marting?
A: Data Marting involves Hardware & Software cost, Network access cost and Time
cost.
What is Next?
Further, you can go through your past assignments you have done with the subject
and make sure you are able to speak confidently on them. If you are fresher then
interviewer does not expect you will answer very complex questions, rather you have to
make your basics concepts very strong.
Second it really doesn't matter much if you could not answer few questions but it
matters that whatever you answered, you must have answered with confidence. So
just feel confident during your interview. We at tutorialspoint wish you best luck to have
a good interviewer and all the very best for your future endeavor. Cheers :-)
TUTORIALS POINT
Simply Easy Learning Page 98
APPENDIX
A
About tutorialspoint.com
Tutorials Point is not a commercial site, this site has been created just for
educational purposes and to help the people who are enthusiastic to learn new
technologies....
Tutorials Point is aiming to provide the Best Training Materials on highly demanding
technical and managerial subjects like:
Python
Ruby
HTML5
CSS
JavaScript and related frameworks
Ruby on Rails
JAVA and related technologies
PMP Exams
Earned Value Management
Six Sigma
Parrot
AJAX
PHP Programming
HTML and XHTML
CGI and Perl
C Programming
C++ Programming
XML-RPC
SOAP Communication
HTTP Protocol
Unix Makefile
Web Services
WSDL and UDDI
Wi-Fi and WiMAX
Many more...
TutorialsPoint is a FREE site and will remain FREE in future as well.....If you think it is
worth to visit this website, kindly share it with your friends and colleagues.
TUTORIALS POINT
Simply Easy Learning Page 99