You are on page 1of 74

DATA WAREHOUSE & OLAP

TECHNOLOGY
Introduction
IT Architecture of an
Enterprise
The IT architecture of an enterprise depends on primarily three things
– the business requirements of the enterprise;
– the available technology at that time; and
– the accumulated investments of the enterprise from earlier
technology generations.
The IT professional’s
Challenge
• The IT professional has to ensure that
– All the requirements from different business users,
each of whom has different and constantly changing needs
are met timely.
• This is only possible if IT team takes help of the latest technology
(Infrastructure and Software)
– Review new methodologies, evaluate new tools, and
maintain
ties with technology partners.
Business Cycle :Shift in
information requirements
• Shift from operational information requirement to decision
information requirements

Old Computing practices could not meet it


Hard to estimate
decision requirements
• For the IT team it is not possible to estimate the decision
requirements of enterprise decision makers.
– As the need and report changes with change in the
business
situation
– Even the decision makers don't know the information
requirements for the decision making.
– Decision makers view data at different levels of detail
and from
different perspective
Operational Data
Stores
“The architectural construct where collective integrated operationa
data is stored is called Operational Data Store”
W.H. Inmon, C. Imhoff and G.
Battas
Operational Data
Stores
Challenges
– Location of the appropriate data sources
– Transformation of the data sources to satisfy the ODS
requirements
– Complexity of near-real-time propagation of changes
Data Warehouse Vs Operational
Data Store
Data Warehouse Operational Data Store
Purpose Strategic Decision Operational Monitoring
Support
Similarities Integrated Data Integrated Data
Subject Oriented Subject Oriented
Differences Static Data Volatile
Historical Data Current
Summarised Data More detailed
What is Data
Warehouse?
Data Warehousing is an architectural construct of information systems
that provides users with current and historical decision support
information that is hard to access or present in traditional operational
data stores.
Characteristics of a Data
Warehouse
• A data warehouse can be viewed as an information system
with following attributes
– It is a database designed for analytical task, using data from
multiple applications/ sources
– It supports relatively small number of users with relatively
longer interactions
– Its usage is read intensive
– Its content is periodically updated
– It contains current and historical data
What is Data
Warehouse?
A formal definition of the data warehouse is
offered by W.H. Inmon:

“A data warehouse is a subject-oriented,


integrated, time-variant, non-volatile
collection of data in support of
management decisions”
Purpose of a Data
Warehouse
Why an enterprise takes a Data Warehouse initiative
• To provide business users with access to data
– This data is enterprise level data
– Easily available (breaking all hurdles and boundaries) via a
secured connection.
• To Provide One Version of the Truth
– The data warehouse are consistent and quality assured before
being served to the business user.
• To Record the Past Accurately
– It has historical data too, which helps in comparing
performances on time domain.
– It allows OLTP system to focus on current
transactions.
Purpose of a Data
Warehouse
• To Slice and Dice Through Data
– Dynamic reports allow the business user to view the data from
different views /angles/ dimensions at different level of detail
• To Separate Analytical and Operational Processing
– Operational processing and decision information processing
systems have different architectural requirements if they
are merged it creates nightmares for the IT team.
• To Support the Reengineering of Decisional Processes
DATA WAREHOUSE:
CBA/ROI
• Management usually is interested in the Cost Benefit Analysis
(CBA)/ Return on Investment (RoI) if they go for a data warehouse
Benefits
• Improved productivity of analytical staff as data is available
– Data is readily available
• Business improvements from analysis of the warehouse data
• Redeployment of staff at more productive work
DATA WAREHOUSE:
CBA/ROI
• Management usually is interested in the Cost Benefit Analysis
(CBA)/ Return on Investment (RoI) if they go for a data warehouse
• Costs
– Hardware
– Software
○ Cost incurred in purchasing licensed software
○ Cost incurred in purchase of additional softwares for
automating extraction, cleansing, loading, retrieval and
presentation of data
– Services
○ System Integrators, Trainers, Consultants
– Internal staff training
DATA WAREHOUSE:
CBA/ROI
• Management usually is interested in the Cost Benefit Analysis
(CBA)/ Return on Investment (RoI) if they go for a data warehouse
• RoI considerations are dependent on many parameters such as
– Current state of the technology in the organisation
– Culture of the organisation in terms of decision making style and
attitude towards technology
– Company’s position in the market vs its competitors
Architecture
(I)
Generally a data warehouses adopts a
three-tier architecture.
– Bottom Tier −
– The bottom tier of the
architecture is the data warehouse
database server.
– It is the relational database
system.
– We use the back end tools and
utilities to feed data into the
bottom tier.
– These back end tools and utilities
perform the Extract, Clean, Load,
and refresh functions.
Architecture
(II)
• Middle Tier − In the middle tier, we have
the OLAP Server that can be
implemented in either of the following
ways.
– By Relational OLAP (ROLAP), which
is an extended relational database
management system. The ROLAP
maps the operations on
multidimensional data to standard
relational operations.
– By Multidimensional OLAP (MOLAP)
model, which directly implements
the multidimensional data and
operations.
• Top-Tier − This tier is the front-end
client layer. This layer holds the query
tools and reporting tools, analysis tools
and data mining tools.
Data
mart(i)
• Containing lightly summarized departmental data and is
customized to suit the needs of a particular department that owns
the data
– Data martsdata warehouse
• Data Mart helps to enhance user's response time due to reduction in
volume of data
• It provides easy access to frequently requested data.
• Data mart are simpler to implement when compared to corporate
Datawarehouse. At the same time, the cost of implementing Data
Mart is certainly lower compared with implementing a full data
warehouse.
Data
mart(II)
• Compared to Data Warehouse, a datamart is agile. In case of change
in model, datamart can be built quicker due to a smaller size.
• A Datamart is defined by a single Subject Matter Expert. On the
contrary data warehouse is defined by interdisciplinary SME from a
variety of domains. Hence, Data mart is more open to change
compared to Datawarehouse.
• Data is partitioned and allows very granular access control
privileges.
• Data can be segmented and stored on different hardware/software
platforms.
Type of Data
Mart
There are three main types of data marts are:
• Dependent: Dependent data marts are created by drawing data
directly from operational, external or both sources.
• Independent: Independent data mart is created without the use of
a central data warehouse.
• Hybrid: This type of data marts can take data from data
warehouses or operational systems.
Dependent, Independent &
Hybrid Data marts
Drivers for having data
mart
• The business drivers for having data marts could be
– Extremely urgent user requirements
– Absence of budget for a full data warehouse architecture
– Absence of a sponsor for an enterprise wide decision support
strategy
– Decentralization of business units
– Attraction of easy to use tools and mind sized projects
Data mart :
Challenges
There are primarily two challenges that may arise form data mart
– Scalability
○ There could be situation where data mart starts growing to
multiple dimensions
– Integration
○ Data consistency and manageability
Data warehouse design
approaches
• Top-down proposed by Bill Inmon
• the data warehouse is designed first and then data mart
are built on top of data warehouse
• Bottom up proposed by Ralph Kimball
• data marts are first created to provide the reporting and
analytics capability for specific business process, later
with these data marts enterprise data warehouse is
created.
Top-Down Approach
Top Down Approach contd..

• Data is extracted from the various source systems. The


extracts are loaded and validated in the stage area. Validation
is required to make sure the extracted data is accurate and
correct. You can use the ETL tools or approach to extract and
push to the data warehouse.
• Data is extracted from the data warehouse in regular basis in
stage area. At this step, you will apply various aggregation,
summarization techniques on extracted data and loaded back
to the data warehouse.
• Once the aggregation and summarization is completed,
various data marts extract that data and apply the some more
transformation to make the data structure as defined by the
data marts.
Bottom Up Approach
Bottom Up Approach contd..

• The data flow in the bottom up approach starts from extraction


of data from various source system into the stage area where it
is processed and loaded into the data marts that are handling
specific business process.
• After data marts are refreshed the current data is once again
extracted in stage area and transformations are applied to
create data into the data mart structure. The data is the
extracted from Data Mart to the staging area is aggregated,
summarized and so on loaded into EDW and then made
available for the end user for analysis and enables critical
business decisions.
OLTP (online transaction
processing)
• OLTP is a class of software programs capable of
supporting transaction-oriented applications on the Internet.
– "transaction" in the context of computer or database
transactions
• Typically, OLTP systems are used for
– order entry,
– financial transactions,
– customer relationship management (CRM) and
– retail sales.
• OLTP applications have high throughput and are insert or update-
intensive in database management.
• These applications are used concurrently by hundreds of users.
OLTP (online transaction
processing)
• The key goals of OLTP applications are availability, speed,
concurrency and recoverability
• Database queries are usually simple, require sub-second
response
times and return relatively few records.
IBM's CICS (Customer Information Control System) is a well-known
OLTP product.
OLAP (Online Analytical
Processing)
• OLAP performs multidimensional analysis of business data and
provides the capability for complex calculations, trend analysis,
and sophisticated data modeling.
• It is the foundation for many kinds of business applications for
Business Performance Management, Planning, Budgeting,
Forecasting, Financial Reporting, Analysis, Simulation Models,
Knowledge Discovery, and Data Warehouse Reporting.
• OLAP enables end-users to perform ad hoc analysis of data in
multiple dimensions, thereby providing the insight and
understanding they need for better decision making.
Types of OLAP:
ROLAP(I)
• Able to directly access data stored in relational databases.
• The notion is that they can readily retrieve transactional data,
although this becomes suspect when very large data sets are in play,
or if more complex calculations are to be delivered, based on the
transactional data.
• ROLAP products enable organizations to leverage their existing
investments in RDBMS (relational database management system)
software.
• ROLAP products provide GUIs and generate SQL execution plans that
typically remove end-users from the SQL writing process.
• However, this over-reliance on processing via SQL statements—
including processing for multidimensional analysis—is a drawback.
Types of OLAP(i):
ROLAP(II)
• There are further drawbacks to structuring a multidimensional
model solely within relational tables: Before end-users can submit
requests, the relevant dimension data must be extracted and
reformatted in de-normalized structures known as star schema or
snowflakes (so-called because of the way the tables are conjoined).
• These tabular structures are necessary to provide acceptable
analytical performance. Sophisticated ROLAP applications also
require that aggregate tables be pre-built and maintained,
eliminating the need to process summary data at runtime
• One advantage of ROLAP over the other styles of OLAP analytic tools
is that it is deemed to be more scalable in handling huge amounts
of data.
• ROLAP sits on top of relational databases therefore enabling it to
leverage several functionalities that a relational database is capable
of.
Types of OLAP(ii)-
MOLAP
• MOLAP products enable end-users to model data in a
multidimensional environment, rather than providing a
multidimensional view of relational data, as ROLAP products do.
• Multidimensional databases allow users to add extra dimensions,
rather than additional tables, as in a relational model.
• MOLAP cube structure allows for particularly fast, flexible data-
modeling and calculations.
• What are the perceived drawbacks of MOLAP tools?
– For one, relevant data must be transferred from relational
systems,which is a potentially “redundant” re-creation of data in
another (multidimensional) database.
– Once data has been transferred, there may be no simple means
for updating the MOLAP “engine” as individual transactions are
recorded by the RDBMS.
Types of OLAP(iii):
HTAP
• Hybrid Transaction / Analytical Processing (HTAP)
– It is in-memory data systems that do both online transaction
processing (OLTP) and online analytical processing (OLAP).
– HTAP relies on newer and much more powerful, often
distributed, processing: sometimes it involves a new hardware
“appliance”, and it almost always requires a new software
platform.
– the key point seems to be that all the technology is sited in the
relational database. And so, there’s no more data replication,
and new transactional information becomes part of an analytical
model in as fast a time as is technologically possible.
– HTAP represents a new way to tie data together in a way that
hasn’t been possible before– a real uniting of relational data
stored in tables with the data models that are used for decision
making by the business leaders.
Types of OLAP(iv):
HOLAP
• HOLAP is the product of the attempt to incorporate the best features of
MOLAP and ROLAP into a single architecture.
• This kind of tool tries to bridge the technology gap of both products by
enabling access to or use of both multidimensional database (MDDB)
and Relational Database Management System (RDBMS) data stores.
• HOLAP systems store larger quantities of detailed data in the
relational tables while the aggregations are stored in the pre-
calculated cubes.
• HOLAP also has the capacity to “drill through” from the cube down to the
relational tables for delineated data.
• Some of the advantages of this system are better scalability, quick data
processing and flexibility in accessing of data sources.
• The issue with HOLAP systems lies precisely in the fact that they
are hybrids: at best they partake of the strengths of other
systems…but they also evince the weaknesses of each, in an
attempted mashup of two distinct technologies.
Types of OLAP
(v)
• Desktop OLAP (DOLAP)
– user can download a section of an OLAP model from another source,
and work with that dataset locally, on their desktop.
– easier to deploy, with a potential lower cost, but almost by definition
comes with a limited functionality in comparison with other OLAP
applications.
• Web OLAP (WOLAP)
– without any kind of option for a local install or local client to access
data.
– considerably lower investment and enhanced accessibility to
connect to the data.
– The fact is that by now most OLAP products provide an option for
Web-only connectivity, while still allowing other client options for
more robust data modeling and other functionality than a Web
client can provide.
• Mobile OLAP
– OLAP functionalities on a wireless or mobile device.
– This enables users to access and work on OLAP data and
applications remotely thorough the use of their mobile devices.
Types of OLAP
(vi)
• Spatial OLAP (SOLAP)
– The aim of Spatial OLAP (thus, SOLAP) is to integrate the
capabilities of both Geographic Information Systems (GIS) and
OLAP into a unified solution, thus facilitating the management
of both spatial and non-spatial data.
– The driving idea is to provide quick exploration of data that can
point to trends and analysis in a geographic context, whether
place-names sourced from a GIS or overlaying maps that show,
for example, customer purchase behaviour.
Multidimensional Data Model
(MDDM)
• The Dimensional Model was Developed for Implementing data
warehouse and data marts.
• MDDM provide both a mechanism to store data and a way for
business analysis
• The two primary component of dimensional model are Dimensions
and Facts.
– Dimensions
– Facts
Measures, Dimensions &
Facts
• Measures are numerical values that mathematical functions work
on. For example, a sales revenue column is a measure because
you can find out a total or average the data. Other examples Cost,
quantity etc.
• Dimensions are qualitative and do not total a sum. For example,
sales region, employee, location, or date are dimensions.
• Facts events are known as "facts."
Measure vs.
dimension
• The difference between them is that the measure contains numerical
values describing a given fact, whereas the dimension of descriptive
values (text) of a given fact, stored in the dimension attributes.
– Example : If we want to describe the sale of cars, the number of
cars sold is a measure, while the brand is an attribute
dimension.
• Dimensions are reference information and define the context of the
analysis of measures.
• In a multidimensional model, each measure is related to several
dimensions.
• The most common dimension has a hierarchical structure that
defines the aggregation of the values associated with it.
Hierarch
y
• Hierarchy is the process of going dipper (rolldown) or going up the
hierarchy path of a dimension.
• Example
Dimensions 2018 2017
(₹ 10,000) (₹ 10,000)
India 1179 1150
China 3200 2865
Nepal 75 60
Bhutan 50 65
Dimensions 2018 2017
(₹ 10,000) (₹
10,000)
NorthZone 300 295
SouthZone 245 265
EastZone 234
345 250

WestZone 400
Dimensions 2018
2017
(₹ 10,000)

(₹ 10,000
)
Delhi
Data Warehouse:
Schema
• Its main task is not transaction processing (OLTP systems) but
analytical processing (OLAP systems) –
• helping management in making business decisions.
• Considering the requirements set before the data warehouse when
designing its data scheme, pay particular attention to two aspects:
– very large amount of data (current + historical)
– Achieving a satisfactory level of efficiency of analytical queries.
• To meet these requirements, specialized physical data schemas have
been created for the data warehouse needs.
• The basic data warehouse diagrams include:
– star schema
– snowflake schema
– Star constellation schema,
– Starflake schema
Star
Schema

• The star's scheme is the simplest data warehouse model.


• Fact table surrounded by dimension tables is in the central place.
• Most often in the facts table you can find sales data
• dimensions are: geography, customer, product, time, business.
Star
Schema
• The data in the fact table should be normalized to the third normal
form.
• dimension tables are usually denormalized
• The facts table consists of two types of columns:
– columns containing numerical values of a given fact measures
– columns with foreign keys to the dimension tables
• The fact table can contain factual information on a retail or aggregate
level
Star
Schema
• Dimension tables are structures often composed of one or more
hierarchies that are used to categorize data.
• In addition to the main keys to the facts table, there are fields
with
attributes describing the given dimension.
• The size of the fact table is much larger than the dimension
tables.
• In the star scheme, all hierarchies of a given dimension are
implemented as a single table.
Characteristic
s
Characteristics of the star scheme:
– It has a simple structure which makes it easy to
understand
– High efficiency of queries due to the small number of
table connections
– Relatively long time of loading data into dimension
tables due to denormalization, as a result of data
redundancy the size of the table can be large
– The dominant structure for the data warehouse,
supported by many tools
Snowflake
Schema
• The snowflake scheme is a
more complex version of the
star scheme because the tables
that describe the dimensions
are normalized.
• This means that each
dimension can have several of
its own dimensions.
• This scheme is used primarily
in situations when we deal
with very complex dimensions
and to better reflect the way
of thinking of users about
data.
Snowflake Schema:
Characteristics
Characteristic features of the snowflake scheme:
– Decrease in the efficiency of queries compared to the star
scheme due to the greater number of table connections
– The structure is easier to modify
– A short time to load data into tables (normalization -> smaller
size table level, space saving)
– Used less often than the star scheme, because the efficiency of
queries is more important than the efficiency of loading data
into the dimension tables.

Due to normalization in the Snowflake schema, the


redundancy is reduced and therefore, it becomes easy to
maintain and the save storage space.
StarFlake
Schema
• Starflake schema is a hybrid
structure that contains a
mixture of star and snowflake
schemas.
• The most appropriate database
schemas use a mixture of
denormalized star and
normalized snowflake .
Constellation
Schema
• A Galaxy Schema contains two fact table that shares
dimension tables.
• It is also called Fact Constellation Schema.
• The schema is viewed as a collection of stars hence the name
Galaxy Schema.
Constellation
Schema
• Characteristics of Galaxy Schema:
– The dimensions in this schema are separated into separate
dimensions based on the various levels of hierarchy.
– For example, if geography has four levels of hierarchy like region,
country, state, and city then Galaxy schema should have four
dimensions.
– Moreover, it is possible to build this type of schema by splitting
the one-star schema into more Star schemes.
– The dimensions are large in this schema which is needed to build
based on the levels of hierarchy.
– This schema is helpful for aggregating fact tables for better
understanding.
Cube
• Data cube is a structure that enable OLAP to achieves the
multidimensional functionality.
• The data cube is used to represent data along some
measure of
interest.
• Data Cubes are an easy way to look at the data ( allow us to look at
complex data in a simple format).
• Although called a "cube", it can be 2-dimensional, 3dimensional,
Or
higher-dimensional.
Cube
• Important concepts associated with data cubes :
– Slicing.
– Dicing.
– Rotating (Pivot)
– Roll-up
– Drill-down.
Slicin
g
• the term slice most
often refers to a two-
dimensional page
selected from the cube.
• subset of a
multidimensional array
corresponding to a
single value for one or
more members of the
dimensions not in the
subset.
Dicing
• A related operation to
slicing .
• In dicing, we define a
subcube of the original
space.
• Dicing provides you
the
smallest available
slice.
Rotating or
Pivoting
• Rotating changes the
dimensional orientation of the
report from the cube data.
• For example …
– rotating may consist of
swapping the rows and
columns, or moving one of
the row dimensions into the
column dimension
– or swapping an off-
spreadsheet dimension with
one of the dimensions in
the page display
Roll up and Roll
down

Roll Down

Roll Up
Problem
• Suppose that a data warehouse consists of the three dimensions
time, doctor, and patient, and the two measures count and
charge, where charge is the fee that a doctor charges a patient
for a visit.
(a)Enumerate three classes of schemas that are popularly used
for modeling data warehouses.
(b) Draw a schema diagram for the above data warehouse
using one
• of the schema classes listed in (a).
(c) Starting with the base cuboid [day, doctor, patient], what
specific OLAP operations should be performed in
order to list the total fee collected by each doctor in
2017?
(d) (d) To obtain the same list, write an SQL query
assuming the data is stored in a relational database
with the schema fee (day, month, year, doctor, hospital,
patient, count, charge ).
Solution

Star Schema
Starting with the base cuboid [day, doctor, patient], what specific
OLAP operations should be performed in order to list the total fee
collected by each doctor in 2017?

The operations to be performed are:


• Roll-up on time from day to year.
• Slice for time = 2017.
To obtain the same list, write an SQL query assuming the data is
stored in a relational database with the schema.

f ee(day, month, year, doctor, hospital, patient, count, charge).

select doctor, SUM(charge) from fee where year = 2017 group by


doctor
Types of Facts
• Additive Facts: Additive facts can be used with any
aggregation function like Sum(), Avg() etc.
• Semi-additive facts are those where only a few of
aggregation function can be applied. For example, Consider
bank account details. You cannot apply the Sum() on the
bank balance that does not give useful results but min() and
max() function may return useful information.
• Non-Additive Facts: You cannot use numeric aggregation
functions such as Sum(), Avg() etc on Non-additive facts.
For example of non-additive fact is any kind of ratio or
percentage. Non numeric facts can also be a non-additive
facts.
Fact less Facts

• A fact table that does not contain any measure is a


fact-less fact table. This table will only contain
keys from different dimension tables. This is often
used to resolve a many-to-many cardinality issue.
• For example, a fact table which has only productID
and date key is a fact-less fact table.
Example
Types of Fact-less fact tables
• This type of fact table establishes the relationship among the
various dimension members from various dimension tables
without any measured value.
• For examples, Student attendance (student-teacher relation
table) capturing table is the fact-less fact. Table will have entry
into it whenever student attend class.
• Following questions can be answered by the student
attendance table:
• Which student is taught by the maximum number of teachers?
• Which class has maximum number of attendance?
Types of Fact-less fact tables
• Coverage table – Describing condition
• This is another kind of fact-less fact. A fact-less-fact table can only answer
‘optimistic’ queries (positive query) but cannot answer a negative query.
Coverage fact is used to support negative analysis reports. For example, an
electronic store did not sell any product for give period of time.
• If you consider the student-teacher relation table, the event capturing fact
table cannot answer ‘which teacher did not teach any student?’ Coverage
fact attempts to answer this question by adding extra flag 0 for negative
condition and 1 for positive condition.
• If the student table has 20 records and teacher table has 3 records then
coverage fact table will store 20 * 3 = 60 records for all possible
combinations. If any teacher is not teaching particular student then that
record will have flag 0 in it.
Types of Dimensions

• Conformed dimensions
• Junk dimensions
• Degenerate dimensions
• Role playing dimensions
Conformed dimensions

• A conformed dimension is the dimension that is


shared across multiple data mart or subject area.
Company may use the same dimension table across
different projects without making any changes to
the dimension tables.
• Conformed dimension example would be Customer
dimension, i.e. both marketing and sales
department can use Customer dimension for their
reporting purpose.
Junk dimension

• A junk dimension is a grouping of typically low


cardinality attributes, so you can remove them from
main dimension.
• You can use Junk dimensions to implement the
rapidly changing dimension where you can use it to
stores the attribute that changes rapidly.
• For example, attributes such as flags, weights, BMI
(body mass index) etc
Degenerate Dimension

• A degenerated dimension is a dimension that is


derived from fact table and does not have its own
dimension table.
• For example, receipt number does not have
dimension table associated with it. Such details are
just for information purpose.
Role Playing dimension

• Dimensions which are often used for multiple


purposes within the same database are called role-
playing dimensions.
• For example, you can use a date dimension for
“date of sale”, as well as “date of delivery”, or
“date of hire”.

You might also like