Professional Documents
Culture Documents
Data warehousing is a concept. It is a set of hardware and software components that can be used
to better analyze the massive amounts of data that companies are accumulating to make better
business decisions. Data Warehousing is not just data in the data warehouse, but also the
architecture and tools to collect, query, analyze and present information.
Operational data is the data you use to run your business. This data is what is typically stored,
retrieved, and updated by your Online Transactional Processing (OLTP) system. An OLTP
system may be, for example, a reservations system, an accounting application, or an order entry
application.
Informational data is created from the wealth of operational data that exists in your business and
some external data useful to analyze your business. Informational data is what makes up a data
warehouse. Informational data is typically:
• Summarized operational data
• De-normalized and replicated data
• Infrequently updated from the operational systems
• Optimized for decision support applications
• Possibly "read only" (no updates allowed)
• Stored on separate systems to lessen impact on operational systems
Metadata/Information Catalogue:
Metadata describes the data that is contained in the data warehouse (e.g. Data elements and
business-oriented description) as well as the source of that data and the transformations or
derivations that may have been performed to create the data element.
Data warehousing can be a key differentiator in many different industries. At present, some of
the most popular Data warehouse applications include:
• sales and marketing analysis across all industries
• inventory turn and product tracking in manufacturing
• category management, vendor analysis, and marketing program effectiveness analysis
in retail
• profitable lane or driver risk analysis in transportation
Data warehousing has quickly evolved into a unique and popular business application class.
Early builders of data warehouses already consider their systems to be key components of their
IT strategy and architecture.
In the 1970’s virtually all business system development was done on the IBM mainframe
computers using tools such as Cobol, CICS, IMS, DB2, etc. The 1980’s brought in the new
mini-computer platforms such as AS/400 and VAX/VMS. The late eighties and early nineties
made UNIX a popular server platform with the introduction of client/server architecture. By
some estimates, more than 70 percent of business data for large corporations still resides in the
mainframe environment.
In recent times advanced users will frequently use desktop database programs that allow them to
store and work with the information extracted from the legacy sources. Many desktop reporting
and analysis tools are increasingly targeted towards end users and have gained considerable
popularity on the desktop.
Another fall side to this is the difficulty in sharing analyses with others, e.g. during budgeting,
one user (say the boss) may create analysis models (say allocation rules) that are to be used by
all others. The first user then generates the final output by putting these analyses together.
Furthermore, semantics of the data may need to be standardized for use before letting it out to
the users. In a desktop environment, this may be nearly impossible.
As the data is stored on disparate systems, it is very difficult to ensure that updates to the data
are communicated to all users, e.g. say sales data comes in, and one person sends brand-wise
summaries to some key users who then forwards them to his sub-ordinates. Some hours after that
it is realized that data from one of the warehouses was missed out, and revised reports are sent.
Result: Different people working on different versions of the same data. Unnecessary
reconciliation issues crop up later.
The last category of analysis systems has been decision support systems and executive
information systems. Decision support systems tend to focus more on detail and are targeted
towards lower to mid-level managers. Executive information systems have generally provided a
higher level of consolidation and a multi-dimensional view of the data, as high level executives
need more the ability to slice and dice the same data than to drill down to review the data detail.
This category is somehow close to Datawarehousing applications, but it has the following defect.
• These systems have data in descriptive standard business terms, rather than in cryptic
computer fields names. Non-technical user design data names and data structures in these
systems for use.
Datawarehousing Applications are in prominence today because there are key technology is
available, hardware prices are down, good Server software, availability of internet
applications, Most importantly lots of tools are available too.
A large Grocery chain, which has large stores around 500 in 3 states. This has many departments
also. Each store deals with around 60000 products. 40000 products are brought from external
vendors. Rest 20000 is prepared from different departments.
After answering the above questions give an attempt to following conceptual questions.
2. Define measure.
3. Find difference between OLTP and OLAP. Supply two SQL’s to justify both system
4. Justify how the importance of Time with respect to both OLTP and OLAP.
sale quantity and selling price. Imagine the kinds of temporal analyses you would like to do with
your sales data. What attributes would you like the Time dimension table to have? How large
could this dimension table possibly get?
Browsing is the activity of exploring a single-dimension table, prior to firing the template query
above, with a two-fold purpose:
(i) choosing attributes for the select clause of the query; this might be done by simply
dragging the attribute name onto a graphic representing the template query or report.
(ii) choosing application constraints for selecting a subset of rows of the table for the query.
Consider the completely denormalized product table, with each row containing
information about product, product category, and package description. The browsing
activity might result in the application constraint Package_desc = “TetraPak” after a
select package_desc where category = “beverage”.
Drilling down is the action of dragging an attribute name (from a dimension table) onto an
existing report.
The size of a dimension table is invariably a tiny fraction of the size of the fact table. Besides, a
data warehouse is updated only once a day, and it is only a small fraction of these days on which
a dimension table is ever updated. The dimension tables are thus left completely unnormalized.
A policy of having completely unnormalized dimension table allows graphical browsing and
automatic SQL generation for all standard user queries of the kind mentioned above.
It also pays to have as many descriptive or qualifying attributes for each dimension as can be
imagined, so that the end-user can set a variety of application constraints. (Consider, for
example, an analyst who wants to know how the sale of paints on Holi (the festival) days differs
from sales on other days.)
There is a subtle difference between a dimension
and an entity. “Time” is a dimension, but is Day Month Quarter Year
associated with a great many entities, as shown in
the accompanying figure. For convenience, one of
the hierarchies (the one most commonly used in Week 4-week
queries) is usually designated the primary period
dimension, and every other hierarchy a secondary
dimension. Each level in a hierarchy is said to
roll-up to the next level (though it is apparent that 13-week
roll-ups are not always uniquely defined). period
Dimensional modeling, then, attempts to depict
the facts and their associated dimensions, without explicitly depicting the entities and
relationships that make up a dimensional hierarchy.
Exercise: You want to analyze course attendances as well as course nominations. Develop a star
schema to do this. Would you choose to have one fact table or multiple? If you choose the
former, will you have any new dimensions?
In general, a single fact table is a good idea where multiple types of facts share a subtype-
supertype relationship with the bulk of the attributes being common. For example, transactions
involving bank accounts have different flavors depending on whether it is a savings account or a
checking account that is being operated. This difference in flavor manifests itself as mild
variations in the composition of attributes that make up the transaction fact. With relatively
unrelated fact types on the other hand, the number of common attributes is small, so the
preferred choice is to have a custom fact table for each fact type, and replicate common
attributes in all custom fact tables to avoid joins.
Exercise: What happens to a table that represents a dimension which has subtypes?
It is common to have data points (facts) that are described as an adjective of your base data (e.g.
actual sales and budgeted sales). Rather than anticipate all adjectives during warehouse design,
we can create a partitioning dimension that holds only the adjectives and their descriptions
(each adjective is called a partition), with the fact table row containing a column called just
sales, along with a new foreign key called partition. This makes it easy to add a new type of
fact such as “forecast sales” : just insert a row in the partition for the new adjective “forecast”,
and have the fact table foreign key partition indicate “forecast” for each record that represents a
forecast.
Big dimensions
Denormalization increases redundancy and, consequently, size. Sometimes a fully denormalized
dimension table does become uncomfortably large. When this happens, the dimension may be
normalized or snowflaked in the following manner. The dimension table stores one key for each
level of the dimension's hierarchy. The lowest level key joins the dimension table to the central
fact table. The rest of the keys join the dimension table to the corresponding higher-level tables.
In a snowflake schema, every dimension is normalized in this manner. The word “snowflake”
refers to the shape of the fully normalized schema when represented graphically.
Exercise: Snowflake the Time dimension. How will you handle multiple hierarchies? Do you
think Time is a big dimension?
There is significant difference of opinion about whether snowflaking should be done at all even
for big dimensions. Ralph Kimball [Kim] believes it should never be done, while the Stanford
Technology Group (an Informix company) believes it is useful, since in a dimension table of
500,000 rows, it is conceivable to save two megabytes per row through normalization and hence
save a full gigabyte of disk. Kimball compares the saving to the overall size of the typical
warehouse which is about 50 GB. Performance is also a factor to be considered: without
snowflaking, a query that needs to analyze sales by brand will have to rummage through 500,000
product rows to filter out perhaps a score brands. Another factor is the complexity of the
structures as perceived by the end-user (a business analyst), and the associated loss of browsing
ability (recall the definition of browsing given earlier). Finally, load programs and overall
maintenance become more difficult to manage as the data model becomes more complex.
A dimension such as Customer may have many qualifying attributes, such as age, sex, and
income_level, which are of interest to the business analyst not as specific values but as a
combination of brackets. Instead of retaining these as individual attributes in the customer
dimension table, we can replace these by a demographics_key that points to a row in a
minidimension table as shown:
Such a schema speeds up queries with complex demographic conditions. Also note that not all
combinations need to be stored in the minidimension table—a customer in the age group <10 is
unlikely to be in a high-income category.
Question: Why make demographics_key a part of the fact table?
Dimensions change their characteristics, albeit slowly. When a customer changes her address,
we can either modify the address in the customer’s record in the customer dimension table (and
lose historical information) or insert a new customer dimension record to capture history.
Exercise: You may be able to limit the number of changes of interest; for example, you might
rarely analyze data that is over a year old, and it is rare for marital status to change more than
three times a year. How could you handle this efficiently?
A demographics_key can also undergo changes over time. A customer may move into a higher
income bracket, requiring a change in demographics_key. It is easy to see that
demographics_key can be treated just like an attribute, albeit a more complex one.
Exercise: Can you use partitioning dimensions to handle adjectives for aggregates? How about
“Average” itself as an adjective?
The following case study is to be read and observed with the solution given At the end of it one
must be very clear how to draw a Dimensional model.
The key to creating an efficient MDDB application is thorough analysis of both the data and its
users. After the data elements have been identified for reporting to the end users, the business
entities will fall in distinct groups of variables with similar characteristics or dimensions.
For example, consider a sales organization, which sells articles to different customers through
different suppliers spread at various geographical locations. From the transaction data (base
tables) of the organization, we can design a fact table which contains the denormalized sales
data at a granularity which is required for creating an MDDB. This fact tables stores the Units
and the Dollars of the sales volume for at a daily level.
The fact table for this case can be outlined as follows:
In this example, the first 4 columns represent the key determinants of the two facts (Products
sold in Units and Dollars).
In an MDDB model, the fields of the four Dimensions must intersect to determine the values of
the facts. To create the dimensions for the MDDB which is to be built from the base table, it
advisable to have dimension tables for each of the dimensions. These dimension tables should be
used to derive the dimension fields, hierarchies and other attributes, if required.
G1 F1 A1
G1 F2 A2
Product
Measure
Measure Code Measure Name Precision
Units Products Sold in Units Unit
Dollars Products Sold in Dollars K
Note that
This design complies with the classic STAR schema design of a
Datawarehouse
The fact table contains a compound primary key, with one segment for each
dimension, and additional columns of additive, numeric facts
Each dimension in the design has a defined hierarchy. The parent-child
relationship (additive/semi-additive/non-additive) could be business driven.
Alternatively, the dimension tables can be designed to have the attribute level
indicator of each record
Each dimension contains (and not restricted to) a key segment
Deriving a dimension from a table in an MDDB is always advisable because of
the following primary reasons:
Dimension size can be reduced by selecting only the valid dimension
fields from the table. Thus, reducing the size of the MDDB
Modifications to the dimension hierarchies can be handled easily
Facilitates better maintenance of the cube build process
Facilitates standardization of dimensions across different MDDB
applications in an organization
Descriptions and levels of the dimension fields can be stored in the
tables
Facts of the data have been clubbed into a Measure dimension to store
different attributes of the facts and handle any changes in future (for example,
Precision is one of the properties of the facts included here).
Even with the use of automated tools, however, the time and costs required for data conversion
are often significant. Bill Inmon has estimated 80% of the time required to build a data
warehouse is typically consumed in the conversion process.
To provide the level of performance needed for a data warehouse, an RDBMS should provide
capabilities for parallel processing - Symmetric Multiprocessor (SMP) or Massively Parallel
Processor (MPP) machines, near-linear scalability, data partitioning, and system administration.
These issue is here how to separate and when to separate because the operational data from
analysis data have not significantly changed with the evolution of the data warehousing systems,
except that now they are considered more formally during the data warehouse building process.
In the analysis and design phase building Datawarehouse is done through a journey from
existing ER model. Advances in technology to producing standard reports, today’s data
warehousing systems support very sophisticated online analysis including multi-dimensional
analysis.
Data warehousing systems are most successful when data can be combined from more than one
operational system. When the data needs to be brought together from more than one source
application, it is natural that this integration be done at a place independent of the source
applications. The primary reason for combining data from multiple source applications is the
ability to cross-reference data from these applications. Nearly all data in a typical data
warehouse is built around the time dimension.
The data warehouse system can serve not only as an effective platform to merge data from
multiple current applications; it can also integrate multiple versions of the same application. For
example, an organization may have migrated to a new standard business application that
replaces an old mainframe-based, custom-developed legacy application. The data warehouse
system can serve as a very powerful and much needed platform to combine the data from the old
and the new applications. Designed properly, the data warehouse can allow for year-on-year
analysis even though the base operational application has changed.
Operational systems are designed for acceptable performance for pre-defined transactions. For
example, an order processing system might specify the number of active order takers and the
average number of orders for each operational hour. Even the query and reporting transactions
against the operational system are most likely to be predefined with predictable volume.
Even though many of the queries and reports that are run against a data warehouse are
predefined, it is nearly impossible to accurately predict the activity against a data warehouse.
Data is mostly non-volatile. This attribute of the data warehouse has many very important
implications for the kind of data that is brought to the data warehouse and the timing of the data
transfer.
Many data warehousing projects have failed miserably when they attempted to synchronize
volatile data between the operational and data warehousing systems.
In short, the separation of operational data from the analysis data is the most fundamental data-
warehousing concept. Not only is the data stored in a structured manner outside the operational
system, businesses today are allocating considerable resources to build data warehouses at the
same time that the operational applications are deployed.
database theory that do not fully apply to data warehousing systems. Even though most data
warehouses are deployed on relational database platforms, some basic relational principles are
knowingly modified when developing the logical and physical model of the data warehouses.
Importance of the possibility of synchronized data in the source systems, e.g. if the product
codes are not standard across the source systems, and product attributes are stored across
systems, it becomes impossible to maintain all the product attributes in the warehouse. This is
one of the most important concerns to be taken care of before initiating a data-warehousing
project. While data scrubbing and cleaning can take care of the past data, for continuous updates
in an efficient manner, these requirements become essential.
The data warehouse model needs to be extensible and structured such that the data from
different applications can be added as a business case can be made for the data. A data
warehouse project in most cases cannot include data from all possible applications right from the
start. Many of the successful data warehousing projects have taken an incremental approach to
adding data from the operational systems and aligning it with the existing data. Data warehouse
model aligns with the business structure
A data warehouse logical model aligns with the business structure rather than the data model of
any particular application. The same logic can be applied to entities in an entity relationship
diagram, which are used as the starting point for operational systems. Though the relevant points
are being covered – i.e. narrow definition of entities in applications and the need to create one
consolidated attribute base – the impact is not felt – perhaps because a direct comparison with
ER modeling is not made. I feel we should also introduce the concept of an enterprise data
model here.
Some of the reasons for de-normalizing the data warehouse model are the same as they would be
for an operational system, namely, performance and simplicity. Static relationships in historical
data. Another reason that de-normalization is an important process in data warehousing
modeling is that the relationship between many attributes does not change in this historical data.
Another important example can be the price of a product. The prices in an operational system
may change constantly. Some of these price changes may be carried to the data warehouse with
a periodic snapshot of the product price table. In a data warehousing system you would carry
the list price of the product when the order is placed with each order regardless of the selling
price for this order . maintain dynamic relationships between business entities, whereas a data
warehouse system captures relationships between business entities at a given time.
The terms and names used in the operational systems are transformed into uniform standard
business terms by the data warehouse transformation processes. It is important to give single
physical definition of an attribute.As an attribute is defined physically for the data warehouse, it
is essential to use meaningful data types and lengths. Use the standard data length and data type
for each attribute everywhere it is used. A functional data dictionary can facilitate this
consistent use of physical attributes. Second important point is to use consistently entity attribute
values
All attributes in the data warehouse need to be consistent in the use of predefined values.
Different source applications invariably use different attribute values to represent the same
meaning. These different values need to be converted into a single, most sensible value as the
data is loaded into the data warehouse. Or, if the data is to be used by the same set of users, one
may need to store the different attributes too, so that users do not see a disconnect between their
operational and decision support systems.
A far more important problem is inconsistent definition and use of the entities themselves, e.g.
some applications may be storing information at the price code detail level (encoded in the
product code), while others may be storing at a planning code level (all price codes, variants,
etc. are clubbed). Moreover, because of user habits, some of the codes used in the planning
system may be outdated, and replaced by new codes in the sales system. Clubbing information
from multiple sources then becomes a big problem.
Some data attributes can easily be defaulted to a reasonable value when the original is missing
or corrupt. Other values can be obtained by referencing other current data. For example, a
missing product attribute such as unit-of-measure on an order entity can be obtained by
accessing the current product database. Some attributes cannot be filled by defaults for missing
values. In fact, it may be dangerous to attempt to assign default for certain types of missing
values. A poor default may corrupt the data and lead to invalid analysis at a later stage. In these
cases, it is safest to leave the missing values as blank. In some cases, it may make sense to pick
a specific value or symbol that indicates a missing value.
February is not stored in the data warehouse. Also, missing data for part of the year prevents
any meaningful year-on-year analysis.
It is important to design a good system to log and identify data that is missing from the data
warehouse. When a user runs a query against the data warehouse, it is essential to understand
the population against which the query is run.
Accurate and complete transformations help maintain the integrity of the data warehouse.
The business rules that are applied in generating summary views can be complex. These
business rules may determine exactly what constitutes a sale or they may determine how a sale
is allocated to a sales or channel entity. In addition to applying the business rules while
generating summary views, the data warehousing system may perform complex database
operations such as multi-table joins. Product sales may be computed by joining the Sales,
Invoice, and Product tables. The criteria to join these tables may be complex. While
individuals mining data in the warehouse detail records need to understand all the complexities
of business rules, most users can retrieve effective summary business information without fully
understanding the detail data.
The single most important reason for building the summary views is the significant performance
gains they facilitate. The summary views in a data warehouse provide multiple views into the
same detail data. These views are predefined dimensions into the detail data. These views
provide an efficient method for the analyst to link with the detail data when necessary.
In most data warehousing projects, there is a need to select a preferred data warehouse access
tool for the most active users. A small number of users generate most of the analysis activity
against the data warehouse. The data warehouse performance can be tuned to the requirements
of the tool appropriate for these active users. This tool can be used for training and
demonstration of the data warehouse. A user can start with a low-level tool that is already
familiar to him or her. After becoming familiar with the data warehouse he or she may be able to
justify the cost and effort involved with using a more complex tool.
The choice of an OLAP tool for a particular environment and application depends on
the key requirements of the analysts, programmers and the end-users of the
application. Before getting into the details of subjecting an OLAP tool to any
evaluation criteria and performing any test, one needs to have the performance
requirements very clear to guide the evaluation process. Some of the focus areas
can be found out by having the following questions answered at the outset:[das/rak]
Usability
How much of OLAP familiarity exists with the users?
Do the users have any quantified performance expectations?
Are expert options, user programming required?
Is GUI a deciding factor?
What is the tolerable online response time?
Critical Resources
What are the critical System resources that need to be optimally
used by the OLAP tool?
What are the resources that are factors for evaluation?
Set of Features
What features will be high priority for the users?
Will the features help the users do the work more productively?
What are the limitations and constraints that should be ruled out?
Are there any problems in the current tool(if any) ?
Another possibility is to reduce the number of dimensional tables by creating dimensions whose
instances are actually combinations of two or more dimensions. If, for example, we were
interested only in 200 brand-region combinations, a brand-region dimension table containing
200 rows could replace the individual brand and region tables.
Storage: aggregate explosion. Full aggregation is often dangerous in practice. Consider a sales
data warehouse where there are 10000 products and 100 stores. On any given day, not all
products are sold in all stores; perhaps only 1% of the possible 10 6 combinations actually occur.
Yet, most products will be sold somewhere, and each store will sell something, so that both
product-wise and store-wise [daily] aggregates are relevant. This number is itself 10100, so that
the database size will double if these aggregates are precomputed. Adding a dimension (say
customer) will clearly compound the problem: (Exercise: how?) The term-compounded growth
factor (CGF) [Pen] indicates the database size with full aggregation as a multiple of its size with
no pre-aggregation, and is usually between 1.5 and 2.5 per dimension.
The solution to this problem is trial and error: use business analysis and query patterns to decide
what aggregates are worth precomputing. Pareto’s law can be expected to hold in this case: 80%
of queries will utilize only 20% of all possible aggregates, so that it is possible to meet
performance requirements adequately by constructing only a fraction of all possible aggregates.
* Definitions of data are always inconsistent across user types, upstream data systems and
applications. No 2 users/systems agree to a common definition easily. This will happen in
RA/Design Stage.
* Data ownership in data warehouses is very less. So the sanctity of the data is most of
the times suspected. This comes out as a problem only when an application is developed to show
the data to the users. This will happen in Testing/Implementation stage.
* High dependency on Up-stream systems. Any delays in making the up-stream
interfaces ready affect the RA/Design/Development cycle. This will happen in
RA/Design/Development.
* In a reporting application, problems mostly originated from the up-stream systems are
attributed to the application. This gives rise to end-user dissatisfaction. This will happen in
Acceptance/Testing/Implementation.
* Obtaining test data for data validation is a risk if real time data is of very high
confidentiality. This will happen in Development/Testing stage.
* Squeezed development cycle due to high visibility of the reporting application. the
background process of collecting data from different sources is not visible to the users. This will
happen in Development/Implementation stage.
* Data-processing time needs to be minimized to ensure availability of most up-to-date
data worldwide at all times. This will happen in Production/Implementation stage.
* The datamart /warehouse support team is located in a country, it is often expected to
address issues from users located in different time zones. This will happen in post-
production/maintenance stage.[das]
Because a transaction is related either to policy formulation or to claims processing, we can have
two fact tables, one for policy formulation and one for claims processing.
Question: What about having just one fact table? Conversely, what about having one fact table
for each type of claim (small/large) or each type of policy (domestic/automobile/industrial)?
The policy creation and claims processing fact tables have the following structures (attributes in
italics indicate those which will also appear in monthly policy and claims snapshots.
• Should the attributes of a claim (as represented by the codified identifier Claim#) constitute
a dimension or should they be part of the FACTS of the claims processing fact table? Note
that the attribute set of a claim depends on the coverage scheme.
Coverages come in a huge number of flavors, just as do customers (insured parties). The set of
attributes which are specified most often as a combination to browse coverages can be hived off
into a minidimension. Examples of such attributes could be risk_level and market_segment.
The subset of rows in the policy creation fact table that represent policy cancellations may have
no useful FACT attributes, since the purpose of each such row is merely to record the fact of
cancellation of an existing policy. If we fragment the fact table horizontally so as to create a
separate table to store just policy cancellations, we have what is called a factless fact table. Each
row of a factless fact table contains only foreign keys corresponding to the relevant dimensions,
but no fact attributes per se.
A factless fact table can also be used to store information about coverages and covered items
that did not have any buyers in, say, a given month. In the simplest case this table would have
columns for the foreign keys Coverage#, CoveredItem# and Month. At the end of each month,
a row would be inserted into this table for each Coverage/Covered item combination that did not
attract any buyers (did not figure in any new policy). This is an example of a coverage table (not
to be confused with insurance coverages).
Exercise: Write a SQL to determine the number of coverage/covered item combinations that did
not figure in any new policy in a given month.
Aggregates. Imagine a requirement to report, for each coverage, the total premiums received
and total claim payments made by calendar month. This needs to be further aggregated to report
total premium and claims payments made by month. Note that the requirement to “aggregate
across policies” usually implies a requirement to also aggregate across other dimensions such as
Employee and InsuredParty to get a useful result.
Exercises:
• How would you design the warehouse to handle a query along the lines of “How many of
our customers have chosen money-back policies?” [Hint: this may require some changes to
the operational system too.]
• If each precomputed aggregate is stored in a separate aggregate fact table, what is the
maximum number of such aggregate fact tables required?
Workout: A sales and marketing data warehouse needs to be built for a manufacturer of hospital
health care products. Salesmen are assigned territories, which roll up to districts, regions and
areas. A product rolls up to subgroup, group and family, where a subgroup is defined by the
assembly line that it rolls out from. (A single manufacturing facility may, of course, have more
than one assembly line.) The products are typically bought by hospitals through buying groups
to which the respective hospitals belong, though a hospital may sometimes buy through a direct
contract or even through a buying group in which it is not a member. We need to be able to
analyze these different types of sales. Design the data warehouse.
About this case study…
Client Profile:
The client designs, manufactures and markets personal computers and related
personal computing and communicating solutions for sale primarily to
education, creative, consumer and business customers. It leads the area of
computing in revolutionary products and innovative designs in all aspects.
The financial data being too sensitive, a two layer user security was
required to prevent users from accessing data outside their area as well as
during freeze period.
Basic Features
1. Multidimensional Conceptual View Dr Codd, believe this to be the central core of OLAP.
2. Intuitive Data Manipulation. Dr Codd prefers data manipulation to be done through direct
actions on cells in the view, without recourse to menus or multiple actions.
3 Accessibility:. In this rule, Dr Codd essentially describes OLAP engines as middleware, sitting
between heterogeneous data sources and an OLAP front-end. Most products can achieve this, but
often with more data staging and batching than vendors like to admit.
4 Batch Extraction vs Interpretive. This rule effectively requires that products offer both their
own staging database for OLAP data as well as offering live access to external data. Today, this
would be regarded as the definition of a hybrid OLAP, which is indeed becoming the most
popular architecture, so Dr Codd has proved to be very perceptive in this area.
5: OLAP Analysis Models. Dr Codd requires that OLAP products should support all four
analysis models that he describes in his white paper (Categorical, Exegetical, Contemplative and
Formulaic). Perhaps Dr Codd was anticipating data mining in this rule?
6: Client Server Architecture .Dr Codd requires not only that the product should be
client/server but that the server component of an OLAP product should be sufficiently intelligent
that various clients can be attached with minimum effort and programming for integration. This
is a much tougher test than simple client/server, and relatively few products qualify. Perhaps he
was anticipating a widely accepted API standard, which OLE DB for OLAP is expected to
become.
7: Transparency. This test is also a tough but valid one. Full compliance means that a user of,
say, a spreadsheet should be able to get full value from an OLAP engine and not even be aware
of where the data ultimately comes from. Like the previous feature, this is a tough test for
openness.
8: Multi-User Support .Dr Codd recognizes that OLAP applications are not all read-only and
says that, to be regarded as strategic, OLAP tools must provide concurrent access (retrieval and
update), integrity and security.
Special Features
9: Treatment of Non-Normalized Data. This refers to the integration between an OLAP engine
and de-normalized source data. Dr Codd points out that any data updates performed in the
OLAP environment should not be allowed to alter stored de-normalized data in feeder systems.
regarded as calculated cells within the OLAP database.
10: Storing OLAP Results: Keeping Them Separate from Source Data. This is really an
implementation rather than a product issue. In effect, Dr Codd is endorsing the widely held view
that read-write OLAP applications should not be implemented directly on live transaction data,
and OLAP data changes should be kept distinct from transaction data..
11: Extraction of Missing Values. All missing values are cast in the uniform representation
defined by the Relational Model
12: Treatment of Missing Values. All missing values to be ignored by the OLAP analyzer
regardless of their source.
Reporting Features
13: Flexible Reporting. Dr Codd requires that the dimensions can be laid out in any way that
the user requires in reports. We would agree, and most products are capable of this in their
formal report writers. Dr Codd does not explicitly state whether he expects the same flexibility
in the interactive viewers.
14: Uniform Reporting Performance. Dr Codd requires that reporting performance be not
significantly degraded by increasing the number of dimensions or database size. Curiously,
nowhere does he mention that the performance must be fast, merely that it be consistent. There
are differences between products, but the principal factor that affects performance is the degree
to which the calculations are performed in advance and where live calculations are done (client,
multidimensional server engine or RDBMS). This is far more important than database size,
number of dimensions or report complexity.
15: Automatic Adjustment of Physical Level. Dr Codd requires that the OLAP system adjusts
its physical schema automatically to adapt to the type of model, data volumes and sparsity.
Dimension Control
16: Generic Dimensionality. Dr Codd takes the purist view that each dimension must be
equivalent in both its structure and operational capabilities. However, he does allow additional
operational capabilities to be granted to selected dimensions (presumably including time), but he
insists that such additional functions should be grantable to any dimension.
17: Unlimited Dimensions & Aggregation Levels. Technically, no product can possibly
comply with this feature, because there is no such thing as an unlimited entity on a limited
computer. In any case, few applications need more than about eight or ten dimensions, and few
hierarchies have more than about six consolidation levels..
18: Unrestricted Cross-dimensional Operations .Dr Codd asserts, and we agree, that all forms
of calculation must be allowed across all dimensions, not just the 'measures' dimension. In fact,
many products that use only relational storage are weak in this area. Most products with a
multidimensional database are strong. These types of calculations are important if you are doing
complex calculations, not just cross tabulations, and are particularly relevant in applications that
analyse profitability.
MOLAP
MOLAP is a high performance, multidimensional data storage format. With MOLAP, data is
stored on the OLAP server. MOLAP gives the best query performance, because it is specifically
optimized for multidimensional data queries. MOLAP storage is appropriate for small to
medium-sized data sets where copying all of the data to the multidimensional format would not
require significant loading time or utilize large amounts of disk space.
ROLAP
With ROLAP data remains in the original relational tables. A separate set of relational tables is
used to store and reference aggregation data. ROLAP is ideal for large databases or legacy data
that is infrequently queried.
HOLAP
HOLAP combines elements from MOLAP and ROLAP. HOLAP keeps the original data in
relational tables but stores aggregations in a multidimensional format. HOLAP provides
connectivity to large data sets in relational tables while taking advantage of the faster
performance of the multidimensional aggregation storage.