You are on page 1of 70

Data Warehousing-Unit 1

Overview
 The term "Data Warehouse" was first coined by
Bill Inmon in 1990.
 According to Inmon, a data warehouse is a
subject-oriented, integrated, time-variant, and
non-volatile collection of data.
 This data helps analysts to take informed
decisions in an organization.
Data, Data everywhere yet ...

 We can’t find the data we need data is scattered over the


network many versions, subtle differences.
 We can’t understand the data we found since, available
data poorly documented.
 We can’t use the data found because, results are
unexpected data needs to be transformed from one form
to other.
 Due to these reasons we need a data which is single,
complete and consistent store of data obtained from a
variety of different sources made available to end users
in a what they can understand and use in a business
context.
 So, the concept of datawarehousing was introduced
since it is a process of transforming data into
information and making it available to users in a timely
enough manner to make a difference.
Understanding a Data Warehouse

Data warehouse refers to a database that is maintained


separately from an organization’s operational databases.
These systems allow for the integration of a variety of
application systems.
They support information processing by providing a solid
platform of consolidated historical data for analysis.
A data warehouse is a database, which is kept separate
from the organization's operational database.
It possesses consolidated historical data, which helps the
organization to analyze its business.
A data warehouse helps executives to organize,
understand, and use their data to take strategic decisions.
Data warehouse systems help in the integration of
diversity of application systems.
Operational vs. Informational Systems

 Operational systems, as their name implies, are the


systems that help the every day operation of the
enterprise.
 These are the backbone systems of any enterprise, and
include order entry, inventory, manufacturing, payroll and
accounting.
 Due to their importance to the organization, operational
systems were almost always the first parts of the
enterprise to be computerized.
 They are OLTP system,s run mission critical applications
need to work with stringent performance requirements for
routine tasks used to run a business.
Operational vs. Informational Systems

 Informational systems deal with analyzing data and


making decisions, often major, about how the enterprise
will operate now, and in the future.
 Not only do informational systems have a different focus
from operational ones, they often have a different scope.
 Where operational data needs are normally focused
upon a single area, informational data needs often span
a number of different areas and need large amounts of
related operational data.
Why a Data Warehouse is Separated from
Operational Databases
An operational database is constructed for well-known
tasks and workloads such as searching particular records,
indexing, etc. In contrast, data warehouse queries are often
complex and they present a general form of data.
Operational databases support concurrent processing of
multiple transactions. Concurrency control and recovery
mechanisms are required for operational databases to
ensure robustness and consistency of the database.
An operational database query allows to read and modify
operations, while an OLAP query needs only read only
access of stored data.
An operational database maintains current data. On the
other hand, a data warehouse maintains historical data.
Definition and Characteristics

A data warehouse is a
• subject-oriented
• Integrated
• time-varying
• non-volatile collection of data that is used primarily in
organizational decision making.
 The four keywords, subject-oriented, integrated, time-
variant, and nonvolatile, distinguish data warehouses
from other data repository systems, such as relational
database systems, transaction processing systems,
and file systems.
Subject-oriented
 A data warehouse is organized around major subjects,
such as customer, supplier, product, and sales.
 Rather than concentrating on the day-to-day
operations and transaction processing of an
organization, a data warehouse focuses on the
modeling and analysis of data for decision makers.
 Hence, data warehouses typically provide a simple and
concise view around particular subject issues by
excluding data that are not useful in the decision
support process.
Integrated:
 A data warehouse is usually constructed by integrating
multiple heterogeneous sources, such as relational
databases, flat files, and on-line transaction records.
 Data cleaning and data integration techniques are
applied to ensure consistency in naming conventions,
encoding structures, attribute measures, and so on..

Time-variant:
 Data are stored to provide information from a
historical perspective (e.g., the past 5–10 years). Every
key structure in the data warehouse contains, either
implicitly or explicitly, an element of time.
Nonvolatile:
 A data warehouse is always a physically separate store
of data transformed from the application data found in
the operational environment.
 Due to this separation, a data warehouse does not
require transaction processing, recovery, and
concurrency control mechanisms.
 It usually requires only two operations in data
accessing: initial loading of data and access of data.
Differences between Operational Database Systems and
DataWarehouses

 The major task of on-line operational database systems is


to perform on-line transaction and query processing.
These systems are called on-line transaction processing
(OLTP) systems.
 They cover most of the day-to-day operations of an
organization, such as purchasing, inventory,
manufacturing, banking, payroll, registration, and
accounting.
 Data warehouse systems, on the other hand, serve users
or knowledge workers in the role of data analysis and
decision making. Such systems can organize and present
data in various formats in order to accommodate the
diverse needs of the different users. These systems are
known as on-line analytical processing (OLAP) systems.
 The major distinguishing features between OLTP and OLAP
are summarized as follows:

Users and system orientation:


 An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and
information technology professionals.
 An OLAP systemis market-oriented and is used for data
analysis by knowledge workers, including managers,
executives, and analysts.
Data contents:
 An OLTP system manages current data that, typically, are too
detailed to be easily used for decision making. An OLAP
system manages large amounts of historical data, provi
 des facilities for summarization and aggregation, and stores
and manages information at different levels of granularity.
These features make the data easier to use in informed
decision making.
Database design:
 An OLTP system usually adopts an entity-relationship (ER)
data model and an application-oriented database design.
 An OLAP system typically adopts either a star or snowflake
model and a subject-oriented database design.

View:
 An OLTP system focuses mainly on the current data within an
enterprise or department, without referring to historical data
or data in different organizations.
 In contrast, an OLAP system often spans multiple versions of
a database schema, due to the evolutionary process of an
organization. OLAP systems also deal with information that
originates from different organizations, integrating information
from many data stores. Because of their huge volume, OLAP
data are stored on multiple storage media.
Access patterns:
 The access patterns of an OLTP system consist mainly of
short, atomic transactions. Such a system requires
concurrency control and recovery mechanisms.
 However, accesses to OLAP systems are mostly read-only
operations (because most data warehouses store historical
rather than up-to-date information), although many could be
complex queries.
OLAP Operations in the Multidimensional Data Model

 In the multidimensional model, data are organized into


multiple dimensions, and each dimension contains
multiple levels of abstraction defined by concept
hierarchies
 This organization provides users with the flexibility to
view data from different perspectives.
 A number of OLAP data cube operations exist to
materialize these different views, allowing interactive
querying and analysis of the data at hand.
Roll-up:

 The roll-up operation (also called the drill-up operation by


some vendors) performs aggregation on a data cube, either
by climbing up a concept hierarchy for a dimension or by
dimension reduction.
 This hierarchy was defined as the total order “street < city <
province or state < country.” The roll-up operation shown
aggregates the data by ascending the location hierarchy
from the level of city to the level of country.
 In other words, rather than grouping the data by city, the
resulting cube groups the data by country.
 When roll-up is performed by dimension reduction, one or
more dimensions are removed from the given cube.
Drill-down

 Drill-down is the reverse of roll-up. It navigates from less


detailed data to more detailed data.
 Drill-down can be realized by either stepping down a concept
hierarchy for a dimension or introducing additional
dimensions. Figure 3.10 shows the result of a drill-down
operation performed on the central cube by stepping down a
concept hierarchy for time defined as “day < month < quarter
< year.”
 Drill-down occurs by descending the time hierarchy from the
level of quarter to the more detailed level of month. The
resulting data cube details the total sales per month rather
than summarizing them by quarter.
Slice and dice

 The slice operation performs a selection on one dimension of


the given cube, resulting in a subcube.
 Figure shows a slice operation where the sales data are
selected from the central cube for the dimension time using
the criterion time = “Q1”
 The dice operation defines a subcube by performing a
selection on two or more dimensions.
Pivot (rotate)

 Pivot (also called rotate) is a visualization operation that


rotates the data axes in view in order to provide an alternative
presentation of the data.
Steps for the Design and Construction of Data
Warehouses

 To design an effective data warehouse we need to


understand and analyze business needs and construct a
business analysis framework.
 The construction of a large and complex information
system can be viewed as the construction of a large and
complex building, for which the owner, architect, and
builder have different views.
 These views are combined to form a complex
framework that represents the top-down, business-
driven, or owner’s perspective, as well as the bottom-up,
builder-driven, or implementor’s view of the information
system.
 Four different views regarding the design of a data
warehouse must be considered: the top-down view, the data
source view, the data warehouse view, and the business
query view.
 The top-down view allows the selection of the relevant
information necessary for the data warehouse. This
information matches the current and future business needs.
 The data source view exposes the information being
captured, stored, and managed by operational systems. This
information may be documented at various levels of detail
and accuracy, from individual data source tables to
integrated data source tables.
 Data sources are often modeled by traditional data
modeling techniques, such as the entity-relationship model
or CASE (computer-aided software engineering) tools.
 The data warehouse view includes fact tables and dimension
tables. It represents the information that is stored inside the
data warehouse, including pre calculated totals and counts,
as well as information regarding the source, date, and time of
origin, added to provide historical context.
 Finally, the business query view is the perspective of data in
the data warehouse from the viewpoint of the end user.
The warehouse design process consists of the following steps.

 Choose a business process to model, for example, orders,


invoices, shipments, inventory, account administration, sales,
or the general ledger.
 If the business process is organizational and involves multiple
complex object collections, a data warehouse model should
be followed. However, if the process is departmental and
focuses on the analysis of one kind of business process, a
data mart model should be chosen.
 Choose the grain of the business process. The grain is the
fundamental, atomic level of data to be represented in the fact
table for this process, for example, individual transactions,
individual daily snapshots, and so on.
 Choose the dimensions that will apply to each fact table
record. Typical dimensions are time, item, customer, supplier,
warehouse, transaction type, and status.
 Choose the measures that will populate each fact table record.
Typical measures are numeric additive quantities like dollars
sold and units sold.
A Three-Tier Data Warehouse Architecture

 The bottom tier is a warehouse database server that is


almost always a relational database system. Back-end
tools and utilities are used to feed data into the bottom
tier from operational databases or other external sources
(such as customer profile information provided by
external consultants).
 These tools and utilities perform data extraction,
cleaning, and transformation (e.g., to merge similar data
from different sources into a unified format), as well as
load and refresh functions to update the data warehouse.
 The data are extracted using application program
interfaces known as gateways. A gateway is supported
by the underlying DBMS and allows client programs to
generate SQL code to be executed at a server.
Contd,..

 The middle tier is an OLAP server that is typically


implemented using either a relational OLAP (ROLAP)
model, that is, an extended relational DBMS that maps
operations on multidimensional data to standard
relational operations; or a multidimensional OLAP
(MOLAP) model, that is, a special-purpose server that
directly implements multidimensional data and
operations.
 The top tier is a front-end client layer, which contains
query and reporting tools, analysis tools, and/or data
mining tools (e.g., trend analysis, prediction, and so on).
 From the architecture point of view, there are three data
warehouse models: the enterprise warehouse, the data mart,
and the virtual warehouse.
Enterprise warehouse:
 An enterprise warehouse collects all of the information about
subjects spanning the entire organization. It provides
corporate-wide data integration, usually from one or more
operational systems or external information providers, and is
cross-functional in scope.
 It typically contains detailed data as well as summarized data,
and can range in size from a few gigabytes to hundreds of
gigabytes, terabytes, or beyond. An enterprise data warehouse
may be implemented on traditional mainframes, computer
super servers, or parallel architecture platforms.
 It requires extensive business modeling and may take years to
design and build.
Data mart:

 A data mart contains a subset of corporate-wide data that is of


value to a specific group of users. The scope is confined to
specific selected subjects.
 For example, a marketing data mart may confine its subjects
to customer, item, and sales. The data contained in data marts
tend to be summarized.
 Depending on the source of data, data marts can be
categorized as independent or dependent.
 Independent data marts are sourced fromdata captured from
one or more operational systems or external information
providers, or from data generated locally within a particular
department or geographic area.
 Dependent data marts are sourced directly from enterprise
data warehouses.
Virtual warehouse:

 A virtual warehouse is a set of views over operational


databases. For efficient query processing, only some of the
possible summary views may be materialized.
 A virtual warehouse is easy to build but requires excess
capacity on operational database servers.
Types of OLAP Servers
Relational OLAP (ROLAP) servers:
 These are the intermediate servers that stand in between a
relational back-end server and client front-end tools. They
use a relational or extended-relational DBMS to store and
manage warehouse data, and OLAP middleware to support
missing pieces.
 ROLAP servers include optimization for each DBMS back
end, implementation of aggregation navigation logic, and
additional tools and services.
 ROLAP technology tends to have greater scalability than
MOLAP technology. The DSS server of Microstrategy, for
example, adopts the ROLAP approach.
Multidimensional OLAP (MOLAP) servers:

 These servers support multidimensional views of data through


array-based multidimensional storage engines. They map
multidimensional views directly to data cube array structures.
 The advantage of using a data cube is that it allows fast
indexing to precomputed summarized data. Notice that with
multidimensional data stores, the storage utilization may be
low if the data set is sparse.
 Many MOLAP servers adopt a two-level storage
representation to handle dense and sparse data sets: denser
sub-cubes are identified and stored as array structures,
whereas sparse sub-cubes employ compression technology
for efficient storage utilization.
Hybrid OLAP (HOLAP) servers:

 The hybrid OLAP approach combines ROLAP and MOLAP


technology, benefiting from the greater scalability of ROLAP
and the faster computation of MOLAP.
 For example, a HOLAP server may allow large volumes of
detail data to be stored in a relational database, while
aggregations are kept in a separate MOLAP store.
 The Microsoft SQL Server 2000 supports a hybrid OLAP server.
GUIDELINES FOR DATA WAREHOUSE
IMPLEMENTATION

Implementation steps
Requirements analysis and capacity planning:
 The first step in data warehousing involves defining
enterprise needs, defining architecture, carrying out
capacity planning and selecting the hardware and
software tools.
 This step will involve consulting senior management as
well as the various stakeholders.

Hardware integration:
 Once the hardware and software have been selected, they
need to be put together by integrating the servers, the
storage devices and the client software tools
Modelling:
 Modelling is a major step that involves designing the
warehouse schema and views. This may involve using a
modelling tool if the data warehouse is complex.
Physical modelling:
 For the data warehouse to perform efficiently, physical
modelling is required. This involves designing the physical
data warehouse organization, data placement, data
partitioning, deciding on access methods and indexing.
Sources
 The data for the data warehouse is likely to come from a
number of data sources. This step involves identifying and
connecting the sources using gateways, ODBC drives or other
wrappers.
ETL:
 The data from the source systems will need to go through an
ETL process. The step of designing and implementing the ETL
process may involve identifying a suitable ETL tool vendor and
 This may include customizing the tool to suit the needs of the
enterprise.
Populate the data warehouse:
 Once the ETL tools have been agreed upon, testing the tools
will be required, perhaps using a staging area.
 Once everything is working satisfactorily, the ETL tools may be
used in populating the warehouse given the schema and view
definitions.
User applications:
 For the data warehouse to be useful there must be end-user
applications. This step involves designing and implementing
applications required by the end users.
Roll-out the warehouse and applications:
 Once the data warehouse has been populated and the end-user
applications tested, the warehouse system and the applications
may be rolled out for the user community to use.
Implementation Guidelines

Build incrementally:

 Data warehouses must be built incrementally. Generally it is


recommended that a data mart may first be built with one
particular project in mind and once it is implemented a number
of other sections of the enterprise may also wish to implement
similar systems.
 An enterprise data warehouse can then be implemented in an
iterative manner allowing all data marts to extract information
from the data warehouse.
 Data warehouse modelling itself is an iterative methodology as
users become familiar with the technology and are then able to
understand and express their requirements more clearly.
Need a champion:

 A data warehouse project must have a champion who is


willing to carry out considerable research into expected costs
and benefits of the project.
 Data warehousing projects require inputs from many units in
an enterprise and therefore need to be driven by someone
who is capable of interaction with people in the enterprise and
can actively persuade colleagues.
 Without the cooperation of other units, the data model for the
warehouse and the data required to populate the warehouse
may be more complicated than they need to be. Studies have
shown that having a champion can help adoption and success
of data warehousing projects.
Senior management support:

 A data warehouse project must be fully supported by the


senior management. Given the resource intensive nature of
such projects and the time they can take to implement, a
warehouse project calls for a sustained commitment from
senior management.
 This can sometimes be difficult since it may be hard to
quantify the benefits of data warehouse technology and the
managers may consider it a cost without any explicit return
on investment.
 Data warehousing project studies show that top
management support is essential for the success of a data
warehousing project.
Ensure quality:

 The data quality in the source systems is not always high and
often little effort is made to improve data quality in the
source systems. Improved data quality, when recognized by
Corporate strategy:

 A data warehouse project must fit with corporate strategy and


business objectives. The objectives of the project must be
clearly defined before the start of the project.
 Given the importance of senior management support for a
data warehousing project, the fitness of the project with the
corporate strategy is essential.

Business plan:

 The financial costs (hardware, software, and peopleware),


expected benefits and a project plan (including an ETL plan)
for a data warehouse project must be clearly outlined and
understood by all stakeholders.
 Without such understanding, rumors about expenditure and
benefits can become the only source of information,
undermining the project.
Training:

 A data warehouse project must not overlook data warehouse


training requirements. For a data warehouse project to be
successful, the users must be trained to use the warehouse and
to understand its capabilities.
 Training of users and professional development of the project
team may also be required since data warehousing is a
complex task and the skills of the project team are critical to
the success of the project.

Adaptability:

 The project should build in adaptability so that changes may be


made to the data warehouse if and when required. Like any
system, a data warehouse will need to change, as needs of an
enterprise change.
 Furthermore, once the data warehouse is operational, new
applications using the data warehouse are almost certain to be
Joint management:

 The project must be managed by both IT and business


professionals in the enterprise. To ensure good
communication with the stakeholders and that the project is
focused on assisting the enterprise’s business, business
professionals must be involved in the project along with
technical professionals.
Data Warehouse Metadata
 Metadata is simply defined as data about data. The data
that are used to represent other data is known as
metadata.
 For example, the index of a book serves as a metadata for
the contents in the book .
 In terms of data warehouse, we can define metadata as
following:
 Metadata is a roadmap to data warehouse.
 Metadata in data warehouse defines the warehouse
objects.
 Metadata acts as a directory. This directory helps the
decision support system to locate the contents of a data
warehouse.
Data Warehouse Metadata
Role Of Metadata

Categories of Metadata
Data Warehouse Metadata
Metadata can be broadly categorized into three categories:
 Business Metadata - It has the data ownership
information, business definition, and changing policies.
 Technical Metadata - It includes database system names,
table and column names and sizes, data types and
allowed values. Technical metadata also includes
structural information such as primary and foreign key
attributes and indices.
 Operational Metadata - It includes currency of data and
data lineage. Currency of data means whether the data is
active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
Data Warehouse Metadata
 The Kimball technical system architecture separates the
data and processes comprising the DW/BI system into
the backroom extract, transformation and load (ETL)
environment and the front room presentation area, as
illustrated in the following diagram
Data Warehouse Metadata
Backroom ETL system

 The Kimball Group has identified 34 subsystems in the ETL


process flow, grouped into four major operations: 
 Extracting the data from the sources,
performing cleansing and conforming transformations,
 delivering it to the presentation server, and managing the ETL
process and back room environment.
Front room presentation area

 The Kimball Architecture presumes the data utilized by the BI


applications is dimensionally-structured, organized by business
process, atomically-grained (complemented by aggregated
summaries for performance tuning), and tied together by the
enterprise data warehouse bus architecture, as described
earlier on this page.
Data Warehouse Metadata
Front room BI applications

 The front room is the public face of the DW/BI system; it’s
what business users see and work with day-to-day.
 There’s a broad range of BI applications supported by BI
management services in the front room, including ad hoc
queries, standardized reports, dashboards and scorecards,
and more powerful analytic or mining/modeling applications.
Metadata

 Metadata is all the information that defines and describes the


structures, operations, and contents of the DW/BI system. 
 Technical metadata defines the objects and processes which
comprise the DW/BI system. 
 Business metadata describes the data warehouse contents in
user terms, including what data is available, where did it
come from, what does it mean, and how does it relate to other
data. Finally, process metadata describes the warehouse’s
operational results
Characteristics of OLAP
1) Multidimensional Conceptual View

 User-analysts would view an enterprise as


being multidimensional in nature – for example, profits
could be viewed by region, product, time period, or
scenario (such as actual, budget, or forecast).
 Multi-dimensional data models enable more
straightforward and intuitive manipulation of data by
users, including slicing and dicing

Characteristics of OLAP
2) Transparency

 When OLAP forms part of the users’ customary


spreadsheet or graphics package, this should be
transparent to the user.
 OLAP should be part of an open systems architecture
which can be embedded in any place desired by the user
without adversely affecting the functionality of the host
tool.
 The user should not be exposed to the source of the
data supplied to the OLAP tool, which may be
homogeneous or heterogeneous.
Characteristics of OLAP
3) Accessibility

 The OLAP tool should be capable of applying its own


logical structure to access heterogeneous sources of
data and perform any conversions necessary to present
a coherent view to the user.
 The tool (and not the user) should be concerned with
where the physical data comes from.
Characteristics of OLAP
4) Consistent reporting performance

 Performance of the OLAP tool should not suffer


significantly as the number of dimensions is increased.

5) Client/server architecture

 The server component of OLAP tools should be


sufficiently intelligent that the various clients can be
attached with minimum effort. The server should be
capable of mapping and consolidating data between
disparate databases.
Characteristics of OLAP
6) Generic Dimensionality

 Every data dimension should be equivalent in its


structure and operational capabilities.

7) Dynamic sparse matrix handling

 The OLAP server’s physical structure should have


optimal sparse matrix handling.

8) Multi-user support

 OLAP tools must provide concurrent retrieval and


update access, integrity and security.
Characteristics of OLAP
9) Unrestricted cross-dimensional operations

 Computational facilities must allow calculation and


data manipulation across any number
of data dimensions, and must not restrict any
relationship between data cells.

10) Intuitive data manipulation

 Data manipulation inherent in the consolidation path,


such as drilling down or zooming out, should be
accomplished via direct action on the analytical
model’s cells, and not require use of a menu or
multiple trips across the user interface.

.
Characteristics of OLAP
11) Flexible reporting

 Reporting facilities should present information in any


way the user wants to view it.

12) Unlimited Dimensions and aggregation levels.

 The number of data dimensions supported should, to


all intents and purposes, be unlimited.
 Each generic dimensions should enable an essentially
unlimited number of user-defined aggregation levels
within any given consolidation path.
Multidimensional Data Model
 The most popular data model for a data warehouse is a
multidimensional model. Such a model can exist in the
form of a star schema, a snowflake schema, or a fact
constellation schema.
Star schema
 The most common modeling paradigm is the star schema,
in which the data warehouse contains (1) a large central
table (fact table) containing the bulk of the data, with no
redundancy, and (2) a set of smaller attendant tables
(dimension tables), one for each dimension.
 The schema graph resembles a starburst, with the
dimension tables displayed in a radial pattern around the
central fact table.
Multidimensional Data Model
 A star schema for AllElectronics sales is shown in Figure,
Sales are considered along four dimensions, namely, time,
item, branch, and location.
 The schema contains a central fact table for sales that
contains keys to each of the four dimensions, along with
two measures: dollars sold and units sold.
 To minimize the size of the fact table, dimension identifiers
(such as time key and item key) are system-generated
identifiers..
Multidimensional Data Model
Snowflake schema

 The major difference between the snowflake and star


schema models is that the dimension tables of the
snowflake model may be kept in normalized form to
reduce redundancies.
 Such a table is easy to maintain and saves storage
space. However, this saving of space is negligible in
comparison to the typical magnitude of the fact table.
 Furthermore, the snowflake structure can reduce the
effectiveness of browsing, since more joins will be
needed to execute a query. Consequently, the system
performance may be adversely impacted. Hence,
although the snowflake schema reduces redundancy, it
is not as popular as the star schema in data warehouse
Multidimensional Data Model

Fact constellation:

 Sophisticated applications may require multiple fact


tables to share dimension tables.
 This kind of schema can be viewed as a collection of
stars, and hence is called a galaxy schema or a fact
constellation.
Data Cube Implementation

1) Pre-compute and store all

 This means that millions of aggregates will need to be


computed and stored.
 Although this is the best solution as far as query response
time is concerned, the solution is impractical since
resources required to compute the aggregates and to store
them will be prohibitively large for a large data cube.
Indexing large amounts of data is also expensive.
2) Pre-compute (and store) none

 This means that the aggregates are computed on the-fly


using the raw data whenever a query is posed.
Data Cube Implementation
 This approach does not require additional space for
storing the cube but the query response time is likely to be
very poor for large data cubes.

3) Pre-compute and store some

 This means that we pre-compute and store the means that


we pre-compute and store the most frequently queried
aggregates and compute others as the need arises.
 Aggregates from the pre-computed aggregates and will be
necessary to access the database (e.g. the data
warehouse) to compute the remaining aggregates
Data Cube Implementation
Efficient Computation of Data Cubes
 At the core of multidimensional data analysis is the
efficient computation of aggregations across many sets of
dimensions. In SQL terms, these aggregations are referred
to as group-by’s.
 Each group-by can be represented by a cuboid, where the
set of group-by’s forms a lattice of cuboids defining a data
cube.
 A data cube is a lattice of cuboids. Suppose that you would
like to create a data cube for AllElectronics sales that
contains the following: city, item, year, and sales in dollars.
You would like to be able to analyze the data, with queries
such as the following:
 “Compute the sum of sales, grouping by city and item.”
 “Compute the sum of sales, grouping by city.”
 “Compute the sum of sales, grouping by item.”
Data Cube Implementation
Efficient Computation of Data Cubes
 The possible group-by’s are the following: (city, item, year),
(city, item), (city, year), (item, year), (city), (item),(year), ()g,
where () means that the group-by is empty (i.e., the
dimensions are not grouped).
 These group-by’s form a lattice of cuboids for the data
cube, as shown in Figure .
 The base cuboid contains all three dimensions, city, item,
and year.
 It can return the total sales for any combination of the
three dimensions. The apex cuboid, or 0-D cuboid, refers to
the case where the group-by is empty.
 It contains the total sum of all sales..”
Data Cube Implementation
Efficient Computation of Data Cubes
Data Cube Implementation
Efficient Computation of Data Cubes

“How many cuboids are there in an n-dimensional data cube?”


➢ The dimension time is usually not explored at only one
conceptual level, such as year, but rather at multiple
conceptual levels, such as in the hierarchy “day < month <
quarter < year”.
➢ For an n-dimensional data cube, the total number of cuboids
that can be generated (including the cuboids generated by
climbing up the hierarchies along each dimension) is:

➢ where Li is the number of levels associated with dimension i.

You might also like