You are on page 1of 7

An Overview of Data Warehousing

Abstract
Data Warehousing is a computer system designed for archiving and analyzing an organization's
historical data, such as sales, salaries, or other information from day-to-day operations The topic
of data warehousing encompasses architectures, algorithms, and tools for bringing together
selected data from multiple databases or other information sources into a single repository, called
a data warehouse, suitable for direct querying or analysis. Data warehouse is constructed by
integrating data from multiple heterogeneous sources.
Normally, an organization summarizes and copies information from its operational systems to
the data warehouse on a regular schedule, such as every night or every weekend; after that,
management can perform complex queries and analysis on the information without slowing
down the operational systems. It supports analytical reporting, structured and/or ad hoc queries
and decision making. In recent years data warehousing has become a prominent buzzword in the
database industry, but attention from the database research community has been limited. In this
paper we motivate the concept of a data warehouse.

Keywords
Data warehousing, OLAP, Data warehouse, Data warehousing architecture, Online Analytical
Processing, Database, Methodology, Data warehouse design.

Introduction
Data warehousing is a collection of decision support technologies, aimed at enabling the
knowledge worker (executive, manager, and analyst) to make better and faster decisions.
The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data
warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data.
This data helps analysts to take informed decisions in an organization.
A data warehouses provides us generalized and consolidated data in multidimensional view.
Along with generalized and consolidated view of data, a data warehouses also provides us

Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective
analysis of data in a multidimensional space. This analysis results in data generalization and
data mining.
Data mining functions such as association, clustering, classification, prediction can be
integrated with OLAP operations to enhance the interactive mining of knowledge at multiple
level of abstraction. That's why data warehouse has now become an important platform for data
analysis and online analytical processing.
OLAP operations include rollup (increasing the level of aggregation) and drill-down
(decreasing the level of aggregation or increasing detail) along one or more dimension
hierarchies,

slice_and_dice

(selection

and

projection),

and

pivot

(re-orienting

the

multidimensional view of data).


Data warehouses might be implemented on standard or extended relational DBMSs, called
Relational OLAP servers. These servers assume that data is stored in relational databases, and
they support extensions to SQL and special access and implementation methods to efficiently
implement the multidimensional data model and operations.
There is more to building and maintaining a data warehouse than selecting an OLAP server and
defining a schema and some complex queries for the warehouse. Different architectural
alternatives exist. Many organizations want to implement an integrated enterprise warehouse
that collects information about all subjects (e.g., customers, products, sales, assets, personnel)
spanning the whole organization.
The past three years have seen explosive growth, both in the number of products and services
offered and in the adoption of these technologies by industry. According to the META Group,
the data warehousing market, including hardware, database software, and tools, is projected to
grow from $2 billion in 1995 to $8 billion in 1998. Data warehousing technologies have been
successfully deployed in many industries: manufacturing (for order shipment and customer
support), retail (for user profiling and inventory management), financial services (for claims
analysis, risk analysis, credit card analysis, and fraud detection), transportation (for fleet
management), telecommunications (for call analysis and fraud detection), utilities (for power

usage analysis), and healthcare (for outcomes analysis). This paper presents a roadmap of data
warehousing technologies, focusing on the special requirements that data warehouses place on
database management systems.

Data Warehouse-Architecture

It includes tools for extracting data from multiple operational databases and external sources; for
cleaning, transforming and integrating this data; for loading data into the data warehouse; and for
periodically refreshing the warehouse to reflect updates at the sources and to purge data from the
warehouse, perhaps onto slower archival storage. In addition to the main warehouse, there may
be several departmental data marts. Data in the warehouse and data marts is stored and managed
by one or more warehouse servers, which present multidimensional views of data to a variety of
front end tools: query tools, report writers, analysis tools, and data mining tools. Finally, there is
a repository for storing and managing metadata, and tools for monitoring and administering the
warehousing system. The warehouse may be distributed for load balancing, scalability, and
higher availability. In such a distributed architecture, the metadata repository is usually
replicated with each fragment of the warehouse, and the entire warehouse is administered
centrally. An alternative architecture, implemented for expediency when it may be too expensive
to construct a single logically integrated enterprise warehouse, is a federation of warehouses or

data marts, each with its own repository and decentralized administration. Designing and rolling
out a data warehouse is a complex process, consisting of the following activities:

Define the architecture, do capacity planning, and select the storage servers, database and
OLAP servers, and tools.

Integrate the servers, storage, and client tools.

Design the warehouse schema and views.

Define the physical warehouse organization, data placement, partitioning, and access
methods.

Connect the sources using gateways, ODBC drivers, or other wrappers.

Design and implement scripts for data extraction, cleaning, transformation, load, and refresh.

Populate the repository with the schema and view definitions, scripts, and other metadata.

Design and implement end-user applications.

Roll out the warehouse and applications.

Literature Review
Different researchers from different areas (database management, information system design,
data and information integration) have come out with their own conclusions. As Mull (1983)
observes:
"We must be prepared to learn more than we can understand."
Thus, there are many sources that could be quoted to illustrate the research methods used to
understand data warehousing and integration concepts. The work summarized here is based on
relevant literature review and on research performed in Norway and Mozambique.

Data Warehouse Methodology


With the quick evolution of information and communication technologies and dissemination of
computer use, most of large and medium size organizations are using Information Systems (IS)
to implement their most important processes. As time goes by, these organizations produce a lot
of data related to their business, but the data is not integrated. Such data are stored within one or
more platforms and constitute the resource for the organizations, but are rarely used for decisionmaking process.

Traditional information systems are not projected to manage and store strategic information.
They are formed by crucial data operational data needed for daily transactions. In terms of
decisions, data are empty and without any transparent value for the decision process of
organizations (Domenico, 2001). Decisions are taken based on administrators experience and
sometimes based on historical facts stored in different information systems.

A data warehouse is projected in a way that data can be stored and accessed and is not restricted
only to tables and relational lines. As the data warehouse is separated from operational databases,
users queries do not cause any impact in these systems. Data warehouse is protected from any
non-authorized alteration or loss of data. Data warehouse contemplates the base and the
resources needed for a Decision Support System (DSS), supplying historic and integrated data.
These data are for top managers, decision makers, partners, donors who need brief,
Summarized and integrated information and for low-level managers, for whom detailed data
helps to observe some tactical aspects of the organization. In this way, data warehouse provides a
specialized database that manages information from corporative databases and external data
sources.

Basic Concepts
Data Warehouse
In the bibliography many definitions can be found about data warehouse:

Inmon (1997) says, that data warehouse is a data collection oriented to a subject,
integrated, changeable in time and not volatile, to provide support to the decision
making process.

Harjinder and Rao (1996) argue, that data warehouse is a running process that
agglutinates data from heterogeneous systems, including historic data and external
data to attend the necessity of structured queries, analytical reports and decision
support.

Barquini (1996) defines the data warehouse as a collection of techniques and


technologies that together provide a systematic and pragmatic approach to solve the
end user problem in accessing information that is distributed in different systems
inside organization.

Kimball et al. (1998) argue that, data warehouse is a source of an organization data,
formed by the union of all corresponding data marts.

To better understand the data warehouse concept it is important to make a comparative


study between the traditional concept of database (DB) and data warehouse (DW).
A database is a collection of operational data, stored and used by application systems from a
specific organization, (Batini and Lenzerini, 1986). Data kept by an organization is called
operational or primitive. Batini and Lenzerini (1986) referred to the data stored in database as
operational data, distinguishing the input, output and other types of data. Based on the Batini &
Lenzerini definition of operational data, I can define data warehouse as a data collection derived
from operational data to support the decisionmaking process. These derived data are most of the
time called analytical, informational or managerial data (Inmon, 1997).

Conclusion
In the area of integrating multiple, distributed, heterogeneous information sources, data
warehousing is a viable and in some cases superior alternative to traditional research solutions.
Traditional approaches request, process, and merge information from sources when queries are
posed. In the data warehousing approach, information is requested, processed, and merged
continuously, so the information is readily available for direct querying and analysis at the
warehouse. Although the concept of data warehousing already is prominent in the database
industry, we believe there are a number of important open research problems, described above,
that need to be solved to realize the flexible, powerful, and efficient data warehousing systems of
the future.

References

Inmon, W.H. (1992), "Building the Data Warehouse." John Wiley & Sons.

Kimball, R. The Data Warehouse Toolkit. John Wiley, 1996.

Wu, M-C., A.P. Buchmann. Research Issues in Data Warehousing. Submitted for
publication.

Gupta A., Harinarayan V., Quass D. Aggregate-Query Processing in Data Warehouse


Environments, Proc. of VLDB, 1995.

Zhuge, Y., H. Garcia-Molina, J. Hammer, J. Widom, View Maintenance in a


Warehousing Environment, Proc. of SIGMOD Conf., 1995.

Vassiliadis P. and Sellis, T., (1999) A Survey of Logical Models for OLAP Databases.
SIGMOD Record.

S. Rizzi, A. Abell, J. Lechtenbrger, J. Trujillo (2006) Research in data warehouse


modeling and design: dead or alive? DOLAP, ACM.

Stefano Rizzi, Matteo Golfarelli. (1998) A Methodological Framework for Data


Warehouse Design. DOLAP 98 Washington DC USA.

Lujan Mora and Juan Trujilio (2003) A Comprehensive Method for Data Warehouse
Design.

Juan Trujillo and Sergio LujnMora (2004) Physical Modeling of Data Warehouses
using UML DOLAP04, Washington, DC, USA.

Lujan Mora and Juan Trujilio (2006).Physical Modeling of Data warehouses by using
UML Component and Deployment Diagrams, Design and implementation issues.
Journal of Database Management.

Deepti Mishra, Ali Yazici, Beri, Pinar Basaran. (2008) A Casestudy of Data Models in
Data Warehousing.

Hui Ma,Yiping Yang and Fan Zhang (2009) The Anti-standardized Design Research of
Data Warehouse.

Kamal Alaskar and Akhtar Shaikh. (2009) Object Oriented Data Modeling for Data
Warehousing.

N.Tryfona,F.Busborg and J.G.Chriastiansen(1998) ,StarER:A Conceptual Model for


Data Warehouse Design.