You are on page 1of 4

Data Warehouse and OLAP Technology:

An Overview
Chapter 3
3.7 Exercises
1. State why, for the integration of multiple heterogeneous information sources, many companies in industry
prefer theupdate-driven approach (which constructs and uses data warehouses), rather than thequery-driven
approach(which applies wrappers and integrators). Describe situations where the query-driven approach is
preferable over the update-driven approach.
Answer:
For decision-making queries and frequently-asked queries, the update-driven approach is more
preferable. This is because expensive data integration and aggregate computation are done before
query processing time. In order for the data collected in multiple heterogeneous databases to be used in
decision-making processes, data must be integrated and summarized with the semantic heterogeneity
problems among multiple databases analyzed and solved. If the query-driven approach is employed,
these queries will be translated into multiple (often complex) queries for each individual database. The
translated queries will compete for resources with the activities at the local sites, thus degrading their
performance. In addition, these queries will generate a complex answer set, which will require further
ltering and integration. Thus, the query-driven approach is, in general, inecient and expensive. The
update-driven approach employed in data warehousing is faster and more ecient since most of the
queries needed could be done o-line.
For queries that are used rarely, reference the most current data, and/or do not require aggregations,
the query-driven approach would be preferable over the update-driven approach. In this case, it may not
be justiable for an organization to pay heavy expenses for building and maintaining a data warehouse,
if only a small number and/or relatively small size databases are used; or if the queries rely on the
current data, since the data warehouses do not contain the most current information.
2. Briey compare the following concepts. You may use an example to explain your point(s).
a) Snowake schema, fact constellation, starnet query model
(b) Data cleaning, data transformation, refresh
(c) Discovery-driven cube, multifeature cube, virtual warehouse
Answer:
(a) Snowake schema, fact constellation, starnet query model
The snowake schema and fact constellation are both variants of thestar schema model, which consists of
a fact table and a set of dimension tables; thesnowake schema contains some normalized dimension
tables, whereas thefact constellation contains a set of fact tables that share some common dimension
tables. Astarnet query model is a query model (not a schema model), which consists of a set of radial lines
emanating from a central point, where each radial line represents one dimension and each point (called
a footprint) along the line represents a level of the dimension, and each step going out from the center
represents the stepping down of a concept hierarchy of the dimension. The starnet query model, as
suggested by its name, is used for querying and provides users with a global view of OLAP operations.
(b) Data cleaning, data transformation, refresh
Data cleaningis the process of detecting errors in the data and rectifying them when possible. Data
transformationis the process of converting the data from heterogeneous sources to a unied data
warehouse format or semantics.Refresh is the function propagating the updates from the data sources
to the warehouse.
(c) Discovery-driven cube, multi-feature cube, virtual warehouse
Adiscovery-driven cube uses precomputed measures and visual cues to indicate data exceptions at all
levels of aggregation, guiding the user in the data analysis process. Amulti-feature cube computes

complex queries involving multiple dependent aggregates at multiple granularities (e.g., to nd the total
sales for every item having a maximum price, we need to apply the aggregate function SUM to the tuple
set output by the aggregate functionMAX). Avirtual warehouse is a set of views (containing data
warehouse schema, dimension, and aggregate measure denitions) over operational databases.
3. Suppose that a data warehouse consists of the three dimensionstime, doctor, andpatient, and the two
measurescount andcharge, wherecharge is the fee that a doctor charges a patient for a visit.
(a) Enumerate three classes of schemas that are popularly used for modeling data warehouses.
(b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a).
(c) Starting with the base cuboid [day, doctor, patient], what specicOLAP operations should be performed
in order to list the total fee collected by each doctor in 2004?
(d) To obtain the same list, write an SQL query assuming the data is stored in a relational database with
the schemafee (day, month, year, doctor, hospital, patient, count, charge).
Answer:
(a) Enumerate three classes of schemas that are popularly used for modeling data warehouses.
Three classes of schemas popularly used for modeling data warehouses are the star schema, the snowake
schema, and the fact constellations schema.
(b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a).
A star schema is shown in Figure 3.1.
(c) Starting with the base cuboid [day, doctor, patient], what specicOLAP operations should be performed
in order to list the total fee collected by each doctor in 2004?
The operations to be performed are:

Roll-up ontime fromday toyear.


Slice fortime=2004.
Roll-up onpatient from individual patient toall.
(d) To obtain the same list, write an SQL query assuming the data is stored in a relational database with
the schema.
f ee(day, month, year, doctor, hospital, patient, count, charge).

selectdoctor, SUM(charge)
fromfee
whereyear=2004
group bydoctor

4. Suppose that a data warehouse forBig-University consists of the following four dimensions:student, course,
semester, and instructor, and two measures countand avg grade. When at the lowest conceptual level (e.g.,
for a given student, course, semester, and instructor combination), theavg grade measure stores the
actual course grade of the student. At higher conceptual levels,avg grade stores the average grade for
the given combination.
(a) Draw asnowake schema diagram for the data warehouse.
(b) Starting with the base cuboid [student, course, semester, instructor], what specicOLAP operations (e.g., roll-up
fromsemester toyear ) should one perform in order to list the average grade ofCS courses for eachBigUniversity student.

(c) If each dimension has ve levels (includingall), such as student< major< status< university< all,
how many cuboids will this cube contain (including the base and apex cuboids)?
Answer:
(a) Draw asnowake schema diagram for the data warehouse.
A snowake schema is shown in Figure 3.2.
(b) Starting with thebase cuboid [student, course, semester, instructor], what specicOLAP operations (e.g., roll-up
fromsemester toyear) should one perform in order to list the average grade ofCS courses for eachBigUniversity student.
The specic OLAP operations to be performed are:

Roll-up on course fromcourse id todepartment.


Roll-up on student fromstudent id touniversity.
Dice on course, student withdepartment=CS anduniversity = Big-University.

Drill-down onstudent fromuniversity tostudent name.


(c) If each dimension has ve levels (includingall), such asstudent< major< status< university< all,
how many cuboids will this cube contain (including the base and apex cuboids)?
This cube will contain 54 = 625 cuboids.
5. Suppose that a data warehouse consists of the four dimensions,date, spectator, location, andgame, and the
two measures,count andcharge, wherecharge is the fare that a spectator pays when watching a game on a
given date. Spectators may be students, adults, or seniors, with each category having its own charge
rate.
(a) Draw astar schema diagram for the data warehouse.
(b) Starting with the base cuboid [date, spectator, location, game], what specicOLAP operations should
one perform in order to list the total charge paid by student spectators at GM Place in 2004?
(c)Bitmap indexing is useful in data warehousing. Taking this cube as an example, briey discuss advantages and problems of using a bitmap index structure.
Answer:
(a) Draw astar schema diagram for the data warehouse.
A star schema is shown in Figure 3.3.
(b) Starting with the base cuboid [date, spectator, location, game], what specicOLAP operations should
one perform in order to list the total charge paid by student spectators at GM Place in 2004?
The specic OLAP operations to be performed are:

Roll-up ondate fromdate id toyear.


Roll-up ongame fromgame id toall.
Roll-up onlocation fromlocation id tolocation name.
Roll-up onspectator fromspectator id tostatus.
Dice withstatus=students,location name=GM Place, andyear=2004

(c)Bitmap indexing is useful in data warehousing. Taking this cube as an example, briey discuss advantages and problems of using a bitmap index structure.
Bitmap indexing is advantageous for low-cardinality domains. For example, in this cube, if dimension
locationis bitmap indexed, comparison, join, and aggregation operations over locationare then reduced
to bit arithmetic, which substantially reduces the processing time. Furthermore, strings of longlocation
names can be represented by a single bit, which leads to signicant reduction in space and I/O. For
dimensions with high-cardinality, such asdate in this example, the vector to present the bitmap index
could be very long. For example, a 10-year collection of data could result in 3650 date records, meaning
that every tuple in the fact table would require 3650 bits, or approximately 456 bytes, to hold the bitmap
index.
6. [Tao Cheng] A data warehouse can be modeled by either astar schema or asnowake schema. Briey
describe the similarities and the dierences of the two models, and then analyze their advantages and
disadvantages with regard to one another. Give your opinion of which might be more empirically useful
and state the reasons behind your answer.
Answer:
They are similar in the sense that they all have a fact table, as well as some dimensional tables. The
major dierence is that some dimension tables in the snowake schema are normalized, thereby further
splitting the data into additional tables. The advantage for star schema is its simplicity, which will enable
eciency, but it requires more space. For snowake schema, it reduce some redundancy by sharing
common tables: The tables are easy to maintain and save some space. However, it is less ecient, and
the saving of space is negligible in comparison with the typical magnitude of the fact table. Therefore,
empirically, star schema is better simply because nowadays, eciency has higher priority over space, if
it is not too huge. Sometimes in industry, to speed up processing, people denormalize data from a
snowake schema into a star schema [1]. Another option here is that some practitioners use a
snowake schema to maintain dimensions, and then present users with the same data collapsed into a
star [2].
Some references for the answer to this question: Oracle Tip: Understand the dierence between star and snowake
schemas in OLAP.
[1] http://builder.com.com/5100-6388-5221690.html Star vs. Snowake Schemas.

You might also like