You are on page 1of 25

Q.1.

What is Data Warehouse and what are the key features of any data warehouses? A data warehouse refers to a database that is maintained separately from an organizations operational databases. Data warehouse systems allow for the integration of a variety of application systems. They support information processing by providing a solid platform of consolidated historical data for analysis. The four major key features of any data warehouses are as follows: i. Subject-oriented: Rather than focuses on day-to-day operational work, a data warehouse concentrating on modeling and analysis of data for decision makers. ii. Integrated: A data warehouse is usually constructed by integrated multiple heterogeneous sources. Data cleaning and integration techniques are applied to ensure consistency in structures. iii. Time-variant: Data are stored to provide information from an historic perspective (e.g., the past 5 10 years). iv. Nonvolatile: Due to a data warehouse is always a physical separate store of data transformed from the application data found in the operational environment, it doesnt require transaction processing, recovery and concurrency control mechanism. It usually requires only two operations in data accessing: initial loading of data and access of data.

Ans.

Q.2.

What is the difference between data warehouses and operational databases? The operational databases perform most of the day-to-day transaction of an organization like purchasing, inventory, manufacturing, banking, payroll, registration, and accounting. For example, IIMC databases are operational. On the other hand, data warehouses use the concept of ETL process and perform a decision support data model.

Ans.

Q.3. Ans.

What is OLAP and OLTP? What is major difference between them? The on-line operational database systems work is to perform on-line transaction and query processing. These systems are called on-line transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization, such as purchasing, inventory, manufacturing, banking, payroll, registration, and accounting. On the other hand, on-line data warehouse systems, works in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are known as on-line analytical processing (OLAP) systems. The major distinguishing features of OLTP and OLAP are as follows: Users and system orientation: An OLTP system is customeroriented and is used for transaction and query processing. An OLAP system is market-oriented and is used for data analysis and modeling. Data contents: An OLTP typically uses too detailed current data for decision making. An OLAP system manages large amount of historic data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. These features make the data easier to use for informed decision making. Database design: An OLTP usually adopts entity-relationship data model and an application-oriented database design. An OLAP typically adopts either star or snowflake model and a subjectoriented database design. View: An OLTP focuses mainly on current data within enterprise or department. In contrast, an OLAP often span multiple versions of a database and due to huge volume of data, OLAP data stored on multiple storage media. Furthermore, OLAP deals with integration of data from heterogeneous sources.

Access pattern: An OLTP uses short and atomic transaction and thus it requires concurrency control and recovery mechanism. On the other hand, OLAP uses mostly read-only operations so, it doesnt requires any concurrency control and recovery mechanisms.

For other distinguish features between OLAP and OLTP, please see the book Data Mining Concepts and Technique by Han and Kamber. Q.4. Ans. How many data warehouse models are there? Explain each of them. There are three data warehouse models. 1. Enterprise warehouse: An enterprise warehouse collects all of the information about subjects spanning the entire organization. It provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is crossfunctional in scope. It typically contains detailed data as well as summarized data, and can range in size from a few gigabytes to hundreds of gigabytes, or terabytes, or beyond. 2. Data Mart: It contains a subset of corporate-wide data that is of value to a specific group of users. The scope is confined to specific selected subjects. For example, a marketing data mart may confine its subjects to Customer, Item and Sales. The data contained in data marts tend to be summarized. Depending on the source of data, data marts can be categorized as independent or dependent. Independent data marts are sourced from data captured from one or more operational systems or external information providers, or from data generated locally within a particular department or geographic area. Dependent data marts are soured directly from enterprise data warehouses. 3. Virtual warehouse: A set of views over operational databases. For query processing, only some of the possible summary views may be materialized. It requires huge capacity on operational databases but is easy to build.

Q.5. Ans.

What is ETL process? ETL means Extraction, Transformation, and Loading. A data warehouse systems use back-end tools and utilities include the following functions: Data extraction: which typically gathers data from multiple, heterogeneous, and external sources. Data cleaning: which detects errors in the data and rectifies them when possible. Data transformation: which converts data from legacy or host format to warehouse format. Load: which sorts, summarizes, consolidates, computes views, Checks integrity, and builds indices and partitions. Refresh: which propagates the updates from the data source to the warehouses.

Q.6. Ans.

What is Metadata? Metadata is data about data. When used in a data warehouse, metadata are the data that define warehouse objects. Explain the following terms: i.) Data cube. ii.) Lattice. iii.) Apex cuboid. iv.) Base cuboid. v.) Non-base cuboid. i.) Data cube: A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. Dimensions are the perspectives or entities with respect to which an organization wants to keep. Facts are typically a numeric measure.

Q.7.

Ans.

Here, the above Figure. is a 3D cube with Location, Time and Item are the three dimensions and measure is Dollar_sold (thousand). We may construct a 4D cube with one additional dimension is Supplier can represent in a series of 3D cubes.

Hence in this way, we may display any n-dimensional cubes as a series of (n-1)-dimensional cubes. The above two Figures. often referred as cuboid. ii.) Lattice: Given a set of dimensions, we can generate a cuboid for each of the possible subsets of the given dimensions. For example, for a given dimensions Location, Time and Item, there are 23 (=8) number of cuboids are possible (e.g., (Location), (Time), . . . , (Time,Item), (Location, Time, Item)).The result would form a lattice of cuboids.

Each showing the data at a different level of summarization, or group by. The lattice of cuboids is then referred to as a data cube. iii.) Apex cuboid: The apex cuboid (0D-cuboid) holds the highest level of summarization. In the above Figure. all represents the apex

cuboid of total_sales or total Dollar_sold. iv.) Base cuboid: The base cuboid (4D-cuboid) contains all four dimensions Time, Item, Location and Supplier. It can return the total sales for any combination of the four dimensions. It has lowest level of summarization. Non-base cuboid: All the other 1D, 2D and 3D cuboids from the above Figure. are non-base cuboids.

v.)

Q.8. What is Database Schema? What are the types of schemas uses for data warehouses? Explain each of them. Ans. Databases changes over time as information is inserted and deleted. The collection of information stored in the database at a particular time is called an instance of the database. The overall design of the database is called the database schema. A multidimensional model can exist in the form of one of the three schema form. i.) Star Schema: A fact table in the middle and the dimension tables are distributed around the fact table. The overall picture is like a star.
Dimension Table 5 Dimension Table 1

Fact Table
List of all Dimensions and Measures
Dimension Table 4

Dimension Table 2

Dimension Table 3

ii.)

Snowflake Schema: This is same an extension of Star Schema, where some of the dimension table are further splitting into additional dimensions. The resulting figure is similar to a Snowflake.

Dimension Table 5

Dimension Table 1

Fact Table
Sub Dimension Table 3

List of all Dimensions and Measure


Dimension Table 4

Dimension Table 2

Sub Dimension Table 1

Dimension Table 3
Sub Dimension Table 2

iii.)

Fact Constellation: This is a combination of Star and Snowflake schema. In this there are more than one fact and dimension tables are present. It viewed as a collection of multiple stars, so it is also called Galaxy Schema.

Dimension Table 1 Dimension Table 5

Dimension Table 1

Sub Dimension Table 3

Fact Table1
List of all Dimensions and Measure

Fact Table2
List of all Dimensions and Measures
Dimension Table 2

Dimension Table 4

Dimension Table 2

Dimension Table 3

Sub Dimension Table 1

Sub Dimension Table 2

Q.9. Ans.

Explain concept hierarchy. Concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level or more general concepts. For example, consider a dimension Location having attributes name (Street, City, povince_or_State , Country) and another dimension Time having attributes (Day , Week , Month , Quarter , Year). The attributes in the Location dimension are related by a total order, forming the concept hierarchy such as Street < City < Province_or_State < Country and for the Time dimension the order is Day < {Month < Quarter; Week} < Year.
Country Year

Province_or_State

Quarter

City

Month

Week

Street

Day

Concept Hierarchy for Location and Time Dimension Since, a Week often crosses the boundary of two consecutive Months. It is usually not treated as a lower abstraction of Month. Instead, it is often treated as a lower abstraction of Year, since a Year contains approximately 52 Weeks. Concept hierarchy may also be defined by discretizing or grouping values for a given dimension or attribute, resulting in a set-grouping hierarchy. For example, the dimension Price ($0 - $1000] can be divide into ($0 $200], ($200 - $400], . . . . . ($800 - $1000] and then further we can divide each of the range.

Q.10. What is measures in a data cube? Explain the types of measures. Ans. In a Multidimensional model each point is represented by the pair of all the dimensions. Say for example, if we have three dimensions with their value Time = Q3, Item = Pepsi and Location = Mumbai, then a value 300,000 is represented by the number of units sold of Pepsi at Mumbai in Quater3. This value has been calculated by a function called Measure. Measure function is divided into three categories. 1. Distributive: A function is distributive if it can be distributed if it is obtained by applying a distributive aggregate function. Distributive measures can be computed efficiently because they can be computed in a distributive manner. SUM(), COUNT(), MAX(), MIN() functions are distributive functions. 2. Algebraic: An aggregate function is algebraic if it can be computed by an algebraic function with M arguments, each of which is applying by a distributive aggregate function. Say, AVG() = SUM()/COUNT() here, SUM() and COUNT() both are distributive aggregate functions. Similar for STDEV(), CORRELATION() etc. 3. Holistic: An aggregate function is holistic if there is no constant bound on the storage size needed to describe a sub aggregate. That is, there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation. Common examples of holistic functions include MEDIAN(), MODE(), and RANK(). In this measure is holistic if it is obtained by applying a holistic aggregate function. Q.11. Ans. Explain each of types of OLAP operations. In a Multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction (e.g. for Time dimension the levels of abstraction are Day< Month< Quarter< Year). The organization provides users with the flexibility to view data from different perspectives. A number of OLAP data cube operations exist to materialize these different views.

1. Roll-up (Drill-up): The roll-up operation performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction. This hierarchy was defined as the total order Street < City < Province_or_State < Country. The roll-up operation shown aggregates the data by ascending the location hierarchy from the level of City to the level of Country. In other words, rather than grouping the data by City, the resulting cube groups the data by Country.

Thus, from the above Figure. for Q1-Canada-Home_Ent. = Q1Vancouver-Home_Ent. + Q1-Toronto-Home_Ent. = 605 + 395 =1000 and similarly for USA. 2. Drill-down: Drill-down is the reverse of roll-up. Drill-down can be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions. A drill-down operation performed on the central cube by stepping down a concept hierarchy for time defined as Day < Month < Quarter < Year. Drill-down occurs by descending the time hierarchy from the level of Quarter to the more detailed level of Month. The resulting data cube details the total sales per month rather than summarizing them by Quarter. A drill-down on the central data cube can occur by introducing an additional dimension, such as Customer_Group.

So, from above Figure. Q1-Vancouver-Security (400) has been divided into three months January (150), February (100) and March (150). 3. Slice and Dice: The slice operation performs a selection on one dimension of the given cube, resulting in a subcube. From the central cube when we select the Time dimension using the criterion time = Q1 known as slicing the cube using Time dimension with Quarter1 value. Furthermore, the dice operation defines a subcube by performing a selection of two or more dimensions. A dice operation on the central cube based on the following selection criteria that involve three dimensions: (Location = Toronto or Vancouver) and (Time = Q1 or Q2) and (Item = home entertainment or computer).

4. Pivot (Rotate): Rotation of data to provide an alternative data presentation is called Pivoting. In the Figure shows a pivot operation where the Item and Location axes in a 2D slice are rotated.

For other OLAP operations such as Drill-Across , Drill-Through, Top-N or Bottom-N items in a list, computing moving averages, growth rates etc. please see any data warehousing books. Q.12. Ans. Explain Starnet Query. The querying of multidimensional databases can be based on a starnet model. A starnet model consists of radial lines emanating from a central point, where each line represents a concept hierarchy for a dimension. Each abstraction level in the hierarchy is called a footprint. These represent the granularities available for use by OLAP operations such as drill-down and roll-up.

Here, each straight line represents a radial line and each circle on the radial line represents an abstraction level. Hence, Customer line is a radial line and Name, Category, Group represents the levels of abstraction (footprints).

Q.13.

What are the different types of data warehousing design process approach? Building and using a data warehouse is a complex task because it requires business skills, technology skills, and program management skills. Regarding business skills, building a data warehouse involves understanding how such systems store and manage their data, how to build extractors that transfer data from the operational system to the data warehouse, and how to build warehouse refresh software that keeps the data warehouse reasonably up-to-date with the operational systems data. Using a data warehouse involves understanding the significance of the data it contains, as well as understanding and translating the business requirements into queries that can be satisfied by the data warehouse. Regarding technology skills, data analysts are required to understand how to make assessments from quantitative information and derive facts based on conclusions from historical information in the data warehouse. These skills include the ability to discover patterns and trends, to extrapolate trends based on history and look for anomalies or paradigm shifts, and to present coherent managerial recommendations based on such analysis. Finally, program management skills involve the need to interface with many technologies, vendors, and end users in order to deliver results in a timely and cost-effective manner. To design an effective data warehouse we need to understand and analyze business needs and construct a business analysis framework. Three different designs approach must be considered for construction of a large and complex data warehouse system: i. top-down approach: this starts with overall design and planning. It is useful in cases where the technology is mature and well known, and where the business problems that must be solved are clear and well understood. bottom-up approach: this starts with experiments and

Ans.

ii.

prototypes. This is useful in the early stage of business modeling and technology development. It allows an organization to move forward at considerably less expense and to evaluate the benefits of the technology before making significant commitments. iii. combined approach: an organization can exploit the planned and strategic nature of the top-down approach while retaining the rapid implementation and opportunistic application of the bottom-up approach.

Q.14. Ans.

How many different types of OLAPs are there? Explain each of them. There are three types of OLAP architectures. i.) ROLAP (Relational OLAP): In this methodology data are stored in a relational database. They are the intermediate servers in between a relational back-end server and client front-end tools. It can handle large amount of data and generates reports using SQL query. So, its performance can be slow if the underlying data size is large. MOLAP (Multidimensional OLAP): In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database. It doesnt handle large amount of data. But, with multidimensional data stores, the storage utilization may be low if the data set is sparse. Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. HOLAP (Hybrid OLAP): The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store.

ii.)

iii.)

Q.15. Ans.

Explain briefly about cube computation and materialization process. OLAP may need to access different cuboids for different queries. Therefore, it is a better idea to compute some of the queries in advance. But, a major challenge related to this pre-computation is the required storage space. The storage requirements are even more excessive when many of the dimensions have many abstraction levels. This problem is called curse of dimensionality. If there were no hierarchies associated with each dimension, then the total number of cuboids for an ndimensional data cube is 2n . Hence, for an n-dimensional data cube, the total number of cuboids that can be generated (including the presence of concept hierarchy for all the dimensions) is Total number of cuboids = By now, we understand that the data cube materialization is important for cube computation. There are three methods for data cube materialization given a base cuboid: i.) No materialization: Do not precompute any of the nonbase cuboids. This leads to computing expensive multidimensional aggregates on-the-fly, which extremely slow. ii.) Full materialization: Precompute all of the cuboids and the resulting lattice of all the computed cuboids is known as full cube, require huge amount of storage space. iii.) Partial materialization: We may compute a proper subset of the cube, which contains only those cells that satisfy some usersspecified criterion, such as where the tuple count of each cell is above some threshold. Such a cube which is a subset of the big cube with some of the computed cell we may call subcube.

Q.16. Ans.

What is ancestor and descendant cell? In an n-dimensional data cube, an i-D cell is an ancestor of a j-D cell , and is a descendant of , if and only if (1) , and (2) for , whenever . In particular , cell is called a parent of cell , and is a child of , if and only if . For example, 1D cell and 2D cell are the ancestors of 3D cell.

Q.17.

Explain each of the following term Full cube, Iceberg Cube, Closed Cube, and Cube Shell.

Ans. Full cube: To compute all the cells of all the cuboids in advance for a given cube is called full cube. In a data of n-dimensional contains 2n cuboids and even more if we consider the concept hierarchy. So, precomputation of full cuboid can require huge amount of storage memory. Iceberg cube: To compute some of the cells (partial materialization) of all the cuboids for a given cube on the basis of minimum support threshold or minimum support. A data cube consists of those cells is known as iceberg cube. For example, compute only those cells for which count 10 or sales $1000 . Closed Cube: In any data cube some of cells may have very less number of information. But, due to the iceberg condition of minimum support is 10, those cells may include in our computation. So, a cell , is a closed cell if there exist no descendant cell of cell . A data cube consists of those cells is called closed cube. Cube Shell: Another strategy for partial materialization is to precompute only the cuboids involving a small number of dimensions such as three to five in a n-dimensional cube. Those cuboids form a cube shell of size three to five.

For some of the cells in a cuboid the measure value becomes zero or from those cells we cant deduce any information for our analysis. Those cuboids are called sparse cuboid. If a cube contains many sparse cuboids, we say that the cube is sparse.

Q.18.
Item ID 101

Obtain a data cube with the following data set.


Item Name TV Item Category Home Electronics Home Electronics Home Appliance Item Price $200 Custom er Name Adams Custom er_Stre et Spring Customer _City Pittsfield Store ID W001 Store Location Georgia Los Angeles Qua ntit y 5 Total Dat e 5Oct -09 9Oct -09 7Ma y09 21Se p09 5Oct -09 21No v09 11Jun -09 8Fe b09 8Fe b09

$1,000

102

DVD

$100

Brooks

Senator

Brooklyn

E002

$700

201

Oven

$500

Curry

North

Rye

N013

New York

10

$5,000

301

Bread

Grocery

$2

Glenn

Sand Hill

Woodside

S003

Boston

100 0 400 0

$2,000

301

Bread

Grocery

$2

Green

Walnut

Stamford

W007

Los Angeles

$8,000

214

Refrige rator

Home Appliance Home Appliance Home Electronics

$240

Hayes

Main

Harrison

N110

New York

10

$2,400

201

Oven

$500

Smith

Alma

Palo Alto

E004

Calioforn ia

$3,000

105

Compu ter

$700

Adams

Spring

Pittsfield

W001

Georgia

14

$9,800

102

DVD

Home Electronics

$100

Lindsay

Park

Pittsfield

W005

Georgia

$500

301

Bread

Grocery Home Appliance Home Electronics Home Electronics Home Electronics

$2

Smith

North

Rye

N011

New York

800

$1,600

201

Oven

$500

Turner

Putnam

Stamford

W006

Chicago

$4,500

105

Compu ter

$700

Williams

Nassau

Princeton

S103

Princeto n

10

$7,000

101

TV

$200

Glenn

Sand Hill Sand Hill

Woodside

S003

Boston

$1,600

106

Pendriv e

$20

Glenn

Woodside

S003

Boston

50

$1,000

214

Refrige rator

Home Appliance

$240

Brooks

Senator

Brooklyn

E002

Los Angeles

$1,680

201

Oven

Home Appliance

$500

Tom

Main

Harrison

N110

New York

$3,500

5Oct -09 19Apr -09 25De c09 13Jun -09 21No v09 20Se p09 2Au g09

Since this a 4D cube. So, we can draw a 3d cube with customer name = Glenn.

Customer Name = Glenn STORE

South

North Q4 1 Q3 1 Q2 1 Q1 tv dvd oven bread comp. refg. pend.

West

East

ITEM

TIME 3D data cube of Store, Item and Time

1.

Roll-up operation for each quarter. Q1

Total sales = $10,300 .

Q2

Total sales = $14,100 . Q3

Total sales = $7,180 . Q4

Total sales = $21,700 . 2. Perform the drill-down operation on Item Name. We have the total sales Figure. on each item name:

Q1

Q2

Q3

Q4

3.

Perform a slice on item name and quarter.

4.

Perform a dice operation on quarter (Q1 and Q2), item name (TV and DVD) and Store (West and South).

5.

Pivot operation on Item and Store dimension.

SQL Operations: We start with an example of 4D data cube with the dimensions are as follows: Location = (Street, City, State, Country, Continent) Item = (Name, Brand, Category, No_sold, Dollar_sales) Time = (Day, Month, Quarter, Year) Customer = (Cname, Sex) Now, lets start with some simple SQL queries: 1. Display all the customers name. select from cname customer;

2. Display the customers information. select * from customer; 3. List all the items with their dollar sales. select itm.name, itm.dollar_sales from item as itm group by itm.category; 4. List all the items with minimum dollar sales is $4,000. select from group by having itm.name item as itm itm.category sum(dollar_sales) 4000;

The above query sometimes called Iceberg cube condition.