You are on page 1of 70

Data Warehouse

and
Data Mining
Prof. Dr. M. S. Memon
sulleman@quest.edu.pk
03337037187
May 20, 2023 1
Architecture of DW
Architecture of DW
• Basic Architecture
Data Warehouse Architecture
Data Warehouse Architecture
Operational Source systems
• These are the operational systems of record
that capture the transactions of the business.
• These systems are outside the data warehouse
which do not have control over contents and
format of the data
• The source systems maintain little historical data
• These systems generate operation data that is
detailed, current and subject to change
Data Staging Area
• Data staging area can be divided into three
phases
– Extraction (E)
– Transformation (T)
– Loading (L)
• Extraction: It means reading and
understanding the source data and copying
the data needed for the data warehouse into
staging area for further manipulation (i.e.
transformation)
Data Staging Area
• Loading: Loading refers to populating of data
warehouse with data that has been extracted from
operational systems.
• There are two types of loads, which generally
take place in data warehouse environment:
– Initial load
– Incremental load
Data Staging Area
• Transformation: The transformation phase applies
a series of rules or functions to the extracted/
loaded data.
• This may include some or all of the following:
– Select only certain columns to load (or if you prefer, null columns
not to load)
– Translate coded values
– Derive a new calculated value (e.g. sale_amount = qty * unit_price)
– Denormalization in order to fit the Dawarehouse Schema
– Summarize multiple rows of data (e.g. total sales for each region)
Data Staging Area
• The Data Staging Area
– Is both a storage and process area (the ETL process)
– It represents everything that happens between the
operational source system and the data presentation
area
– The key architectural requirement for data staging area
is that it is off-limits to business users and does not
provide query and presentation services
– should be accessible only to skilled professionals
ETL versus ELT
• ETL (The traditional approach): ETL (Extract, transform,
and load) is a process in data warehousing that involves:
– Extracting data from outside sources
– transforming it to fit business needs, and ultimately
– loading it into the data warehouse
• ELT (The Teradata Approach): ELT (Extract, Load and
Transform) strategy extracts and loads the data into a
Teradata Database first, then uses the power and
performance of the Teradata Warehouse to perform the
transformation
Data Presentation Area
• Extended Relational DBMS
(ROLAP servers)
– data stored in RDB
– star-join schemas
– support SQL extensions (Cube)
– Index structures (bitmap, join)
• Multidimensional DBMS
(MOLAP servers)
– data stored in arrays (n-dimensional
array)
– direct access to array data structure
– poor storage utilization, especially
when the data is sparse
Data Presentation Area
• The Data Presentation Area
– Is where data is organized, stored and made available
for queries, report writers, and other analytical
processing
– This area is the Warehouse as far as the business
community is concerned
Data Access Tools
• Analysis / OLAP / DSS Tools

• Querying / Reporting Tools

• Data Mining
Warehouse components
Component: Operational Data
• The sources of data for the data warehouse is
supplied from:
– The data from the mainframe systems in the traditional
network and hierarchical format
– Data can also come from the relational DBMS like
Oracle, Informix
– In addition to these internal data, operational data also
includes external data obtained from commercial
databases and databases associated with supplier and
customers
Component: Load Manager
• The load manager (also called the front end
component) performs all the operations associated
with extraction and loading data into the data
warehouse
• These operations include simple transformations of
the data to prepare the data for entry into the
warehouse
• The size and complexity of this component will vary
between data warehouses and may be constructed
using a combination of vendor data loading tools and
custom built programs
Component: Warehouse Manager
• The warehouse manager performs all the operations
associated with the management of data in the warehouse
This component is built using vendor data management
tools and custom built programs
• The operations performed by warehouse manager include:
– Analysis of data to ensure consistency
– Transformation and merging the source data from temporary
storage into data warehouse tables
– Create indexes and views on the base table.
– Generation of de-normalization
– Generation of aggregation
– Backing up and archiving of data
Warehouse Manager: Detailed Data

• This area of the warehouse stores all the


detailed data in the database schema

• In most cases detailed data is not stored online


but aggregated to the next level of details

• However the detailed data is added regularly


to the warehouse to supplement the
aggregated data
Warehouse Manager: Lightly and Highly
summarized data
• The area of the data warehouse stores all the
predefined lightly and highly summarized
(aggregated) data generated by the warehouse
manager
• This area of the warehouse is transient as it will
be subject to change on an ongoing basis in
order to respond to the changing query profiles
• The purpose of the summarized information is to
speed up the query performance
• The summarized data is updated continuously as
new data is loaded into the warehouse
Warehouse Manager: Archive and Back-
up Data
• This area of the warehouse stores detailed and
summarized data for the purpose of archiving and
back-up

• The data is transferred to storage archives such


as magnetic tapes or optical disks
Warehouse Manager: Meta Data
• The data warehouse also stores all the Meta data (data
about data) definitions used by all processes in the
warehouse
• It is used for variety of purposed including:
– The extraction and loading process – Meta data is used to map data
sources to a common view of information within the warehouse.
– The warehouse management process – Meta data is used to
automate the production of summary tables.
– As part of Query Management process Meta data is used to direct a
query to the most appropriate data source.
• The structure of Meta data will differ in each process,
because the purpose is different
Component: Query Manager
• The query manager (also called the back end
component) performs all operations associated
with management of user queries
• This component is usually constructed using
vendor end-user access tools, data warehousing
monitoring tools, database facilities and custom
built programs
• The complexity of a query manager is determined
by facilities provided by the end-user access
tools and database
Component: End-user Access Tools
• The principal purpose of data warehouse is to
provide information to the business managers
for strategic decision-making
• These users interact with the warehouse using
end user access tools
• The examples of some of the end user access
tools can be:
– Reporting and Query Tools
– Application Development Tools
– Executive Information Systems Tools
– Online Analytical Processing Tools
– Data Mining Tools
Data Marts
&
Storage Structure

M. S. Memon CSE
Dept. QUEST
May 20, 2023 Nawabshah 25
Data Mart
• A data mart is a special purpose subset of
enterprise data for a particular function or
application (It may contain detail or summary data
or both).
• Data Mart types:
– Independent—created directly from operational systems
to a separate physical data store
– Logical—exists as a subset of existing data warehouse.
– Dependent—created from data warehouse to a separate
physical data store
Data Marts
Operational
Systems

Independent
Data
Mart Dependent
Data
Data Mart
Warehouse

Logical Data Mart

A dimensional model for a large data


warehouse consists of between 10 and 25
similar-looking data marts. Each data marts
will have 5 to 15 dimensional tables
Dependent vs. Independent
Data Marts
Data warehouse and Data mart
• Bill Inmon in 1998: “The single most important
issue facing the IT manager this year is
whether to build the data warehouse or the data
mart first.”
– Top-down or bottom-up approach? Practical Approach
• Fundamental issues:
– Enterprise-wide or departmental?
– Which first – data warehouse or data mart?
– Dependent or independent data marts?
– Build pilot or with a full-fledged implementation?
Top-down Approach
• The breaking down of a system to gain insight into
its compositional sub-systems.
• The advantages of this approach are:
– An enterprise view of data
– Inherently architectured—not a union of disparate data marts
– Single, central storage of data about the content
– Centralized rules and control
• The disadvantages are:
– Takes longer to build
– High risks of failure
Bottom-up Approach
• The piecing together of systems (elements) to
give rise to more complex systems.
• The advantages of this approach are:
– Faster and easier implementation of manageable pieces
– Favorable ROI (return on investment)
– Less risk of failure
– Inherently incremental
• The disadvantages are:
– Spread redundant data in every data mart
– Makes inconsistent data
Reasons behind Data marts
• Provide access to data most often for analysis
• Improve end-user response time due to reduction
in volume of data to be accessed
• Simpler to build compared with establishing a
corporate data warehouse
• Cost of implementation is normally less than that
required to establish a data warehouse
Storage Structure
• Storage structure
– After extraction from the operational data, in DW
information is stored in databases
– The databases are operated by a DBMS
– Different database structures can be used for a DW:
• Relational model (RDB) operated by a RDBMS
• MultiDimensional model (MDB) operated by a MDBMS
Storage Structure
• RDB and MDB are complementary and do not
have to exclude each other
– In the staging area some RDBMS can be used,
however it must be off-limits to user queries because of
performance reasons
– By default, normalized databases are excluded from the
presentation area, which should be strictly multi-
dimensionally (MDBMS)
Relational DB
• DB in relational model
– A database is seen as a collection of predicates over a
finite set of variables
– The content of the DB is modeled as a set of relations
in which all predicates are satisfied
Relational DB
• A relation is defined as a set of tuples that have
the same attributes
– It is usually described as a table
Multidimensional DB
• Multidimensional DB (MDB) are optimized for
DW and OLAP applications
– They are created using input from the staging area
– Designed for efficient and convenient storage and
retrieval of large volumes of data
– Stored, viewed and analyzed from different perspectives
called dimensions
Multidimensional DB
• Example: an automobile manufacturer wants to increase sale
volumes
– Evaluation requires to view historical sale volume figures from multiple
dimensions
– Sales volume by model, by color, by dealer, over time
– A relational structure of the given evaluation would be:
Multidimensional DB
Multidimensional DB
• The complexity grows quickly with the number of
dimensions and the number of positions
– Example: 3 dimensions with 10 values each and no
indexes
– If it is considered to view information in a RDB, it would
result in a worst case of 103=1000 records view
Multidimensional DB
• Now, if performance is considered
– For responding to a query when car type = Sedan,
color = Blue, and dealer = Berg
• RDBMS has to search through 1000 records to find the right
record
• MDB has more knowledge about where data lies
• The maximum of searches in the case of MDB is of 30
positions
• Average case 18 vs. 501 positions
Multidimensional DB
• If the query is more relaxed
– Total sales across all dealers for all colors when
car type = sedan
• RDBMS still has to go through the 1000 records
• MDB, however, goes only through a slice of 10x10
Multidimensional DB
• Performance advantages
– MDBs are an order of magnitude faster than RDBMSs
– Performance benefits are more for queries that generate
cross-tab views of data (the case of DW)
• Conclusion
– The performance advantages offered by MDBs
facilitates the development of interactive decision
support applications like OLAP that can be impractical in
a relational environment
Data Dimensionality: Cube
Cube: A group of data cells arranged by the
1st Qtr Sales of
dimensions of the data.
TV in Pakistan Total annual sales
Date
1Qtr 3Qtr sum
4Qtr of TV in Pakistan
TV 2Qtr
PC Pakistan
VC
R
su China
m
India
sum

Total annual sales of


TV, PC & VCR in
India
Data Dimensionality
Possible Views of Sale Customers
•How many Products sold at Time
to specific Customer(s)? Sale

•How many Customers bought at


Products
specific Time the Product(s)?

•At which Time(s) the Customer


(s) bought the specific Product
(s)?

Time
Multi-dimensional Data
• Measures - numerical data being tracked
• Dimensions - business parameters that define a
transaction
• Example: Analyst may want to view sales data
(measure) by geography, by time, and by product
(dimensions)
• Dimensional modeling is a technique for
structuring data around the business concepts
• ER models describe “entities” and “relationships”
• Dimensional models describe “measures” and
“dimensions”
Multi-dimensional Model
“Sales by product line over the past six
months” “Sales by store between 1990 and
1995”
Store Info Key columns joining fact table
to dimension tables Numerical Measures

Prod Cod e Time e Store e Sales Qty


Cod Cod
Fact table for
Product Info
measures

Dimension tables Time Info


...
Multi-dimensional Model
• Every dimensional model (DM) is composed of
one table with a composite primary key, called the
fact table, and a set of smaller tables called
dimension tables
• Forms ‘star-like’ structure, which is called a star
schema or star join
• Dimensions are organized into hierarchies
– E.g., Time dimension: days  weeks  quarters
– E.g., Product dimension: product  product line  brand
• Dimensions have attributes
– e.g., owner city and county of store
RDB vs. MDB
• Any database manipulation is possible with both
technologies
• MDBs however offer some advantages in the
context of DW:
– Ease of data presentation
– Ease of maintenance
– Performance
RDB vs. MDB
• Ease of data presentation
– Data views are natural output of the MDBs
– Obtaining the same views in RDB requires a complex
query
• Example with Walmart and Sybase:
select sum(sales.quantity_sold) from sales, products,
product_categories, manufacturers, stores, cities where
manufacturer_name = ‘Colgate’ and product_category_name =
‘toothpaste’ and cities.population < 40 000 and
trunc(sales.date_time_of_sale) = trunc(sysdate-1) and
sales.product_id = products.product_id and sales.store_id =
stores.store_id and products.product_category_id =
product_categories.product_category_id and
products.manufacturer_id = manufacturers.manufacturer_id and
stores.city_id = cities.city_id
RDB vs. MDB
• Ease of data presentation
– Top k queries cannot be expressed well in SQL
• Find the five cheapest hotels in Frankfurt
– SELECT * FROM hotels h WHERE h.city = Frankfurt AND 5 >
(SELECT count(*) FROM hotels h1 WHERE h1.city = Frankfurt
AND h1.price < h.price);
• Some RDBMS extended the functionality of SQL with STOP
AFTER functionality
– SELECT * FROM hotels WHERE city = Frankfurt Order By
price STOP AFTER 5;
RDB vs. MDB
• Ease of maintenance
– No additional overhead to translate user queries into
requests for data
• Data is stored as it is viewed
– RDBs use indexes and sophisticated joins
which require significant maintenance and
storage to provide same intuitiveness
RDB vs. MDB
• Performance
– Performance of MDBs can be matched by RDBs through
database tuning
– Not possible to tune the database for all possible ad-
hoc queries
– Aggregate navigators are helping RDBs to catch up
with MDBs as far as aggregation queries are concerned
Multidimensional DB
• When MDBs are in-appropriate?
– If the dataset types are not highly related, using a MDB
results in a sparse representation
Multidimensional DB
• When MDBs are appropriate?
– In the case of highly interrelated dataset types MDBs
are recommended for greatest ease of access and
analysis
– Examples of applications
• Financial Analysis and Reporting
• Budgeting
• Promotion Tracking
• Quality Assurance and Quality Control
• Product Profitability
DW Architectures
• Popular DW architectures
– Generic Two-Tier Architecture
– Independent Data Mart
– Dependent Data Mart and Operational Data Store
– Logical Data Mart and Active Warehouse
– Three-Tier Architecture
• Other
– One-Tier Architecture
– N-Tier Architecture
– Web-based Architecture
Layered
Architectures
• Generic Two-Tier Architecture
– Data is not completely current in the DW
– Periodic extraction
Layered Architectures
• Data analysis comes in two flavors
– Depending on the execution place of the analysis
• Thin Client
– Analytics are executed on the server
– Client just displays
– This architecture fits well for Internet/Intranet DW access
Layered Architectures
• Fat Client
– The server just delivers the data
– Analytics are executed on the client
– Communication between client and server must be able to sustain
large data transfers
Layered
Architectures
• Independent Data Mart
– Mini warehouses – limited in scope
– Separate ETL for each independent Data Mart
– High Data Marts access complexity
Layered Architectures
• Dependent Data Mart and Operational Data
Store
– Single ETL for the DW
– Data Marts are loaded from the DW
– More simple data access than in the previous case
Layered Architectures
• Logical Data Mart and Active Warehouse
– The ETL is near real-time
– Data Marts are not separate databases, but logical views of the
DW
DW vs. Data Marts
Layered Architectures
• Generic Three-Tier
Architecture
– Derived data
• Data that had been selected,
formatted, and aggregated for
DSS support
– Reconciled data
• Detailed, current data
intended to be the single,
authoritative source for all
decision support
Layered Architectures
• One-Tier Architecture
– Theoretically possible
– Might be interesting for mobile applications
• N-Tier Architecture
– Higher tier architecture is also possible
• But the complexity grows with the number of tier-interfaces

• Web-based Architecture
– Advantages:
• Usage of existing software, reduction of costs, platform independence
– Disadvantages:
• Security issues: data encryption/user access and identification
DistributedDW
• In most cases the economics and technology
greatly favor a single centralized DW
• But in some cases, distributed DW make sense
• Types of distributed DW
– Geographically distributed
• Local DW/global DW
– Technologically distributed DW
• Logically one DW, physically more DW
– Independently evolving distributed DW
• Uncontrolled growth
DistributedDW
• Geographically distributed
– In the case of corporations spread
around the world
• Information is needed both locally and
globally
– A distributed DW makes sense
• When much processing occurs at the
local level
• Even though local branches report to the
same balance sheet, the local
organizations are their own companies
DistributedDW
DistributedDW
• Technologically distributed DW
– Placing the DW on the distributed technology of a vendor
– Advantages
• The entry cost is cheap – large centralized hardware is expensive
• No theoretical limit to how much data can be placed in the DW – one
can add new servers to the network
– As the DW starts to expand network data communication
starts playing an important role
• Example: Let’s simplify and consider one has 4 nodes holding each
data regarding the last 4 years
• Now let’s consider one has a query which needs to access the data
from the last 4 years: such a query arises the issue of transporting large
amount of data between processors
DistributedDW
• Independently evolving distributed DW
– In practice there are many cases in which independent
DW are developed concurrently and uncontrolled in the
same organization
• The first step many corporations make is to build a DW for
financial or marketing
• Once it is successfully set up, other parts of the organization
follow independently the process resulting in the coexistence
of more independent DW in the same organization
• This problem will be addressed later

You might also like