Exam

2.
Data Warehouse (DW)

• A data warehouse (DW) is a collection of data marts
representing historical data from different operations
in the company.
– Flat, text, numbers and etc.
– 5 to 10 years of huge amount of data.
• This data is stored in a structure optimized for
querying and data analysis as a DW.
• A DW refers to a data repository that is maintained
separately from an organization’s operational
databases.
• Inmon, A DW is a subject-oriented, integrated, time-
variant and non-volatile collection of data in support
of management’s decision making process.
2.DW
• Subject-oriented
– DW is organized around major subjects such as
customer, supplier, product, and sales.
– Do not concentrate on the day-to-day operations
and transaction processing of an organization.
– Exclude data that are not useful in the decision
support process.
• Integrated
– DW constructed by integrating multiple
heterogeneous sources, such as relational
databases, flat files, and online transaction records.
– Data cleaning, ensuring consistency, naming
conventions, data structure and data integration.
2.DW
• Time Variant
– Data are stored to provide information from an
historic perspective (e.g., the past 5–10 years).
– Time element is very important in the data sources.
• Nonvolatile
– DW is always a physically separate store of data
transformed from the application data found in the
operational environment.
– DW does not require transaction processing,
recovery, and concurrency control mechanisms.
– It requires only two operations: initial loading of
data and access of data.
2.DW-Data Mart
• It contains a subset of corporate-wide data that is of
value to a specific group of users.
• It may be implemented using specific selected
subjects.
• Data marts are usually implemented on low-cost
departmental servers.
• The data contained in data marts tend to be
summarized.
• The implementation cycle of a data mart is more
likely to be measured in weeks rather than months or
years.
• Depending on the source of data, data marts can be
categorized as independent or dependent.
2.DW-Data Mart
– Independent data marts are sourced from data captured
from one or more operational systems or external
information providers, or from data generated locally
within a particular department or geographic area.
– Dependent data marts are sourced directly from
enterprise data warehouses.
7. Data Cube: A Multidimensional Data Model
• A data cube allows data to be modeled and
viewed in multiple dimensions.
• It is defined by dimensions and facts.
• Dimensions are the perspectives or entities with
respect to which an organization wants to keep
records.
• A multidimensional data model is typically
organized around a central theme, such as sales.
• This theme is represented by a fact table.
• Facts are numeric measures.
• 2-D Cube
• 4-D Cube
• The cuboid that holds the lowest level of

summarization is called the base cuboid.
• The 0-D cuboid, which holds the highest level of
summarization, is called the apex cuboid.
End of Introduction to DW
DATA WAREHOUSEING
AND
DATA MINING
DW IMPLEMENTATION
Alaa Khalaf Hamoud

Contents
1. Introduction
2. DW Design Considerations
3. DW Design Steps
4. Defining Features
5. Architectural Types
6. Overview of Components
1. Introduction
• Two factors that drive you to build and use
DW (business and technological).
– Business Factors: users want to make
decision quickly and correctly using all
available data.
– Technological Factors: address the
incompatibility of ODS, the changing IT
infrastructure, its capacity is increasing
and cost is decreasing so that building a
DW is easy.
Contents
1. Introduction
3. DW Design Steps
1. Business Considerations
2. Data Content
3. Design Considerations
4. Meta Data
5. Data Distribution
6. Tools
7. Performance Considerations
2.1 Business Considerations
• Two approaches of designing DW:
– Top down collected enterprise wide
business requirements and decided to
build an enterprise DW with subset data
marts.
– Bottom up approach helps us
incrementally build the DW by
developing and integrating data marts
as and when the requirements are clear.
• Top Down Approach ( serves as a systematic solution and
minimizes integration problems).
– However, it is expensive, takes a long time to develop,
and lacks flexibility due to the difficulty in achieving
consistency and consensus for a common data model for
the entire organization.
• Bottom Up Approach (to the design, development, and deployment of
independent data marts) provides flexibility, low cost, and rapid return
of investment.
– It can lead to problems when integrating various disparate data marts into a
consistent enterprise DW.
• The data marts are integrated or combined together to form a DW.
– The advantage of using this approach is that they do not require high initial
costs and have a faster implementation time.
– This approach is more realistic but the complexity of the integration may
become a serious obstacle.
2.2 Data Content
• The content and structure of the DW are reflected in its
data model.
• The data model is the template that describes how
information will be organized within the integrated DW
framework.
• The DW data must be a detailed data.
• It must be formatted, cleaned up and transformed to fit the
DW data model.
2.3 Design Considerations

• In general a DW data from multiple heterogeneous
sources into a query database this is also one of the
reasons why a DW is difficult to build.
2.4 Meta Data
• It defines the location and contents of data in
the DW.
• Meta data is searchable by users to find
definitions or subject areas.
2.5 Data Distribution

• Data volumes continue to grow in nature,
therefore, it becomes necessary to know how
the data should be divided across multiple
servers.
• The data can be distributed based on the
subject area, location (geographical region), or
time (current, month, year).
2.6 Tools
• A number of tools are available that are
specifically designed to help in the
implementation of the DW.
• How to choose tool (type, license, support, …)?
2.6 Performance Considerations

• The actual performance levels are dependent
and vary widely from one environment to
another.
• It is relatively difficult to predict the
performance of a typical DW.
Contents
1. Introduction
3. DW Design Steps
3. DW Design Steps
1. Choose a business process to model (e.g., orders,
invoices, shipments, inventory ,account
administration, sales, or the general ledger).
2. Choose the business process grain, which is the
fundamental, atomic level of data to be
represented in the fact table for this process (e.g.,
individual transactions ,individual daily snapshots,
and so on).
3. Choose the dimensions that will apply to each fact
table record, typical dimensions are time, item,
customer, supplier, transaction type and so on.
4. Choose the measures that will populate each fact
table record, typical measures are numeric
additive quantities like dollars sold and units sold.
Contents
1. Introduction
3. DW Design Steps
• Subject-Oriented Data
• Integrated Data
• Time-Variant Data
• Nonvolatile Data
• Data Granularities
Contents
1. Introduction
3. DW Design Steps
5.1 Centralized DW
• This architectural takes into account the
enterprise-level information
requirements.
• Atomic level normalized data at the
lowest level of granularity (3NF) and
some summarized data is included.
• Queries and applications access the
normalized data in the central DW.
• There are no separate data marts.
5.2 Independent Data Marts
• This type evolves in companies where the
organizational units develop their own data
marts for their own specific purposes.
• Although each data mart serves the
particular organizational unit, these separate
data marts do not provide “a single version of
the truth.”
• These data marts are independent of one
another and have inconsistent data
definitions and standards.
5.3 Federated
• Some companies get into data warehousing with
an existing legacy of an assortment of decision-
support structures in the form of operational
systems, extracted datasets, primitive data
marts, and so on.
– It may not be prudent to discard all that huge
investment and start from scratch.
– A solution where data may be physically or logically
integrated through shared key fields, overall global
metadata, distributed queries, and such other
methods.
– In this architectural type ,there is no one overall DW.
5.4 Hub-and-Spoke
• Inmon corporates Information Factory approach.
• Similar to the centralized DW architecture, here too is an overall
enterprise-wide DW.
• Atomic data in the 3NF is stored in the centralized DW.
• The major and useful difference is the presence of dependent data
marts in this architectural type.
• Dependent data marts obtain data from the centralized data warehouse.
• The centralized DW forms the hub to feed data to the data marts on the
spokes.
• The dependent data marts may be developed for a variety of purposes:
departmental analytical needs ,specialized queries, data mining, and so
on.
• Each dependent data mart may have normalized ,de-normalized,
summarized, or dimensional data structures based on individual
requirements
• Most queries are directed to the dependent data marts although the
centralized DW may itself be used for querying.
• This architectural type results from adopting a top-down approach to
DW development.
5.5 Data-Mart Bus
• This is the Kimbal conformed super marts approach.
• You begin with analyzing requirements for a specific
business subject such as orders, shipments, billings,
insurance claims, car rentals ,and so on.
• You build the first data mart (super mart) using business
dimensions and metrics.
• These business dimensions will be shared in the future
data marts.
• The principal notion is that by conforming dimensions
among the various data marts, the result would be
logically integrated super marts that will provide an
enterprise view of the data.
• The data marts contain atomic data organized as a
dimensional data model.
• This architectural type results from adopting an enhanced
bottom-up approach to DW development.
Contents
1. Introduction
3. DW Design Steps
1. Source Data Components
a) Production Data
b) Internal Data
c) Archived Data
d) External Data
2. Data Staging Component (ELT, Staging Area)
3. Data Storage Component
4. Information Delivery Component
5. Metadata Component
6.1 Source Data Components
• Production Data Comes from various
operational systems of the enterprise
– Financial systems, manufacturing systems,
systems along the supply chain, and customer
relationship management systems.
• Based on the information requirements in
the DW , you choose segments of data from
the different operational systems
• Many variations in the data formats from
different hardware platforms and operating
systems.
• Internal Data In every organization, users keep their “private”
– spreadsheets, documents, customer profiles, and sometimes even
departmental databases.
– Parts of which could be useful in a DW.
• You cannot ignore the internal data held in private files in your
organization.
• The size of the internal data that should be included in the DW
add more complexity.
• Internal data adds additional complexity to the process of
transforming and integrating the data before it can be stored in
the DW.
• In every operational system, you periodically take the old data
and store it in archived files.
• Sometimes data is left in the operational system databases for as
long as five years.
• Much of the archived data comes from old legacy systems that
are nearing the end of their useful lives in organizations.
• Archived Data Many different methods of archiving
exist. There are staged archival methods.
• At the first stage, recent data is archived to a
separate archival database that may still be online.
• The older data is archived to flat files on disk storage
or tape cartridges or microfilm and even kept off-site.
• DW keeps historical snapshots of data. You
essentially need historical data for analysis over time.
For getting historical information, you look into your
archived data sets.
• Depending on your DW requirements, you have to
include sufficient historical data. This type of data is
useful for discerning patterns and analyzing trends.
• External Data Most executives depend on
data from external sources for a high
percentage of the information they use.
• They use statistics relating to their industry
produced by external agencies and national
statistical offices.
– Market share data of competitors.
– Standard values of financial indicators for their
business to check on their performance.
6.2 Data Staging Component
• After extraction process, you have to prepare
the data for storing in the DW.
• The extracted data coming from several
disparate sources needs to be changed,
converted, and made ready in a format that is
suitable to be stored for querying and analysis.
• Three major functions need to be performed for
getting the data ready (ETL).
• Data staging provides a place and an area with a
set of functions to clean, change, combine,
convert, de-duplicate, and prepare source data
for storage and use in the DW.
• In extraction, source data may be from different
source machines in diverse data formats and may
include data from spreadsheets and local
departmental data sets.
– Tools are available on the market for data extraction (in-
house programs or outside tools may entail high initial
costs.
• Data extraction and data transformation presents
greater challenges.
– First, you clean the data extracted from each source.
– Standardization of data elements forms a large part of
data transformation.
– Data transformation involves combining processes; you
combine data from single source record or related data
elements from many source records.
– Sorting and merging of data takes place on a large scale
in the data staging area.
• For loading, two distinct groups of tasks form
the data loading function (initial load and
refresh)
• When you complete the design and construction
of the DW and go live for the first time, you do
the initial loading of the data into the DW
storage .
• The initial load moves large volumes of data
using up substantial amounts of time.
• As the DW starts functioning, you continue to
extract the changes to the source data,
transform the data revisions, and feed the
incremental data revisions on an ongoing basis.
• Refresh may be ( yearly, quarterly, monthly,
daily).
6.3 Data Storage Component
• The data storage is a separate repository
(operational and DW).
• The operational systems typically contain only
the current data.
– Data repositories contain the data structured in
highly normalized formats for fast and efficient
processing.
• You need to keep large volumes of historical
data for analysis.
– The data in the data warehouse in structures
suitable for analysis, and not for quick retrieval of
individual pieces of information.
• The data storage for the DW is kept separate
from the data storage for operational systems.
6.3 Data Storage Component
• Data in operational systems
– Support update data during transactions
– Could change from moment to moment
• When analysts use the data in DW for analysis the data
should be:
– Stable and represents snapshots at specified periods.
– Data storage must not be in a state of continual updating.
• For this reason, the DWs are “read-only” data
repositories.
• The Source database and DW must be open to different
tools.
– Relational Database Management Systems (RDBMS)
– Multidimensional Database Management Systems (MDBMS).
• Data extracted from the DW storage is aggregated in many
ways and the summary data is kept in the MDDBS.
6.4 Information Delivery Component
• Novice user needs prefabricated reports and
preset queries.
• Casual user needs information once in a while,
not regularly( prepackaged information).
• The business analyst looks for ability to do
complex analysis using the information in the
DW.
• The power user wants to be able to navigate
throughout the DW, pick up interesting data,
format his or her own queries, drill through the
data layers, and create custom reports and ad
hoc queries.
• Different methods of information delivery are required for DW users.
– Ad hoc reports are predefined reports primarily meant for novice and casual
users.
– Provision for complex queries, multidimensional (MD) analysis, and
statistical analysis cater to the needs of the business analysts and power
users.
– Information fed into executive information systems (EIS) is meant for senior
executives and high-level managers.
– Some DW also provide data to data mining applications.
• Data mining applications are knowledge discovery systems where the
mining algorithms help you discover trends and patterns from the usage
of your data.
• In your DW, you may include several information delivery mechanisms
(online queries and reports).
• The users will enter their requests online and will receive the results
online.
• You may set up delivery of scheduled reports through e-mail or you may
make adequate use of your organization’s intranet for information
delivery.
• Recently, information delivery over the Internet has been gaining
ground.
6.5 Metadata Component
• Metadata in a DW is similar to the data
dictionary or the data catalog in a
database management system (but
much more than a data dictionary).
• In the data dictionary, you keep the
information about the logical data
structures, the information about the
files and addresses, the information
about the indexes, and so on.
End of DW Implementation
DATA WAREHOUSING AND DATA MINING
IS403
ETL- PART 1
Alaa Khalaf Hamoud

Contents
1. Introduction
2. ETL Requirements
3. Data Extraction
3.1 Source Identification
3.2 Data Extraction Techniques
4. Data Transformation
1. Introduction
• Where are ETL functions location ?
1. Introduction
• To change the data to information you need
to capture the data.
• You cannot simply dump that data into the
DW and call it strategic information.
• You have to subject the extracted data to all
manner of transformations.
• You must perform all three functions of ETL
for successfully transforming data into
strategic information or business
intelligence.
1. Introduction
• ETL functions are challenging primarily because
of the nature of the source systems.
• The source systems are:
– Very diverse, disparate , on multiple
platforms and different operating systems.
– Many of them are older legacy applications
running on obsolete database technologies.
– Generally, historical data on changes in
values are not preserved in source
operational systems. Historical information is
critical in a DW.
– Quality of data is dubious in many old source
systems that have evolved over time.
1. Introduction
– Source system structures keep changing over
time because of new business conditions. ETL
functions must also be modified accordingly.
– Gross lack of consistency among source
systems is prevalent.
• Same data is likely to be represented differently
in the various source systems..
– Most source systems do not represent data in
types or formats that are meaningful to the
users, many representations are cryptic and
ambiguous.
Contents
1. Introduction
2. ETL Requirements
3. Data Extraction
2. ETL Requirements
• The primary reason for the complexity of (ET)
functions is the tremendous diversity of the
source systems.
• For initial bulk refresh as well as for the
incremental data loads, the sequence is:
– Triggering for incremental changes,
– Filtering for refreshes and incremental loads,
– Data extraction,
– Data transformation,
– Integration, cleansing, and applying to the DW
DBMS.
2. ETL Requirements
• In a large enterprise, we could have a
bewildering combination of computing
platforms, operating systems, database
management systems, network protocols,
and source legacy systems.
• Usually, the refreshes, whether for initial
load or for periodic refreshes, cause
difficulties, not so much because of
complexities, but because these load jobs
run too long.
2. ETL Requirements
Contents
1. Introduction
2. ETL Requirements
3. Data Extraction
3. Data Extraction
• Some data may be on other legacy network,
hierarchical data models, or may still be in
separated flat files.
• You may want to include data from spreadsheets
and local departmental data sets.
• Tools for extraction:
– You may use outside tools (market) suitable for
certain data sources (entail high initial costs).
– You may want to develop in-house programs for
other data sources (ongoing costs for
development and maintenance).
3. Data Extraction
• Two major factors differentiate the data
extraction in a new operational system and
DW.
– For a DW, you have to extract data from many
disparate sources. Next, you have to extract data
on the changes for ongoing incremental loads as
well as for a one-time initial full load.
– For operational systems, all you need is one-time
extractions and data conversions.
• Effective data extraction is a key to the
success of your DW.
– Pay special attention to the issues and formulate
a data extraction strategy for your DW.
3. Data Extraction
• Here is a list of data extraction issues:
1. Source identification: identify source
applications and source structures.
2. Method of extraction: for each data
source, define whether the extraction
process is manual or tool-based.
3. Extraction frequency: for each data
source, establish how frequently the
data extraction must be done: daily,
weekly, quarterly, and so on.
3. Data Extraction
4. Time window: for each data source,
denote the time window for the
extraction process.
5. Job sequencing: determine whether the
beginning of one job in an extraction job
stream has to wait until the previous job
has finished successfully.
6. Exception handling: determine how to
handle input records that cannot be
extracted.
• Encompasses the identification of all
the proper data sources.
• It includes examination and verification
that the identified sources will provide
the necessary value to the DW.
– Determine if the source systems have data
needed for this data mart.
– Then, you have to establish the correct
data source for each data element in the
data mart.
• You should understand the nature of source
data before examine extraction techniques.
• Business transactions keep changing the data
in the source systems.
• In most cases, the value of an attribute in a
source system is the value of that attribute at
that time.
• If you look at every data structure in the
source operational systems, the day-to-day
business transactions constantly change the
values of the attributes in these structures.
• When a customer moves to another
state, the data about that customer
changes in the customer table in the
source system.
• Data in the source systems are said to be
time-dependent or temporal.
• This is because source data changes
with time. The value of a single variable
varies over time.
• Operational data may falling into two broad
categories depends on the nature).
– Current Value: here the stored value of an
attribute represents the value of the attribute at
this moment of time.
– Periodic Status: in this category, the value of the
attribute is preserved as the status every time a
change occurs. At each of these points in time,
the status value is stored with reference to the
time when the new value became effective. This
category also includes events stored with
reference to the time when each event occurred.
• Static data will be used for the initial load of
the DW.
• Sometimes, you may want a full refresh of a
dimension table.
– For example, assume that the product master of
your source application is completely revamped.
In this case, you may find it easier to do a full
refresh of the product dimension table of the
target DW.
• Data of revisions is also known as
incremental data capture.
• Strictly, it is not incremental data but the
revisions since the last time data was
captured.
• If the source data is transient, the capture of
the revisions is not easy.
• For periodic status data or periodic event
data, the incremental data capture includes
the values of attributes at specific times.
• Extract the statuses and events that have
been recorded since the last data extraction.
• Capture through Transaction Logs
– Use the transaction logs of DBMSs maintained
for recovery from possible failures.
– Transaction logs for (adds, updates, or deletes a
row from a database table).
– This data extraction technique reads the
transaction log and selects all the committed
transactions.
– There is no extra overhead in the operational
systems because logging is already part of the
transaction processing.
• Capture through Database Triggers
– Triggers are special stored procedures
(programs) that are stored on the database and
fired when certain predefined events occur.
– You can create trigger programs for all events for
which you need data to be captured (to capture
all changes to the records in the customer
table).
– The output of the trigger programs is written to a
separate file that will be used to extract data for
the DW.
– Applicable to database applications.
• Capture in Source Applications
– Referred to as application assisted data capture.
– The source application is made to assist in the
data capture for the DW. You have to modify the
relevant application programs that write to the
source files and databases.
– You revise the programs to write all adds,
updates, and deletes to the source files and
database tables. Then other extract programs
can use the separate file containing the changes
to the source data.
• Immediate Data Extraction
– In this option, the data extraction is real-time.
– It occurs as the transactions happen at the
source databases and files.
• Deferred Data Extraction
– In the cases discussed before, data capture
takes place while the transactions occur in the
source operational systems. The data capture
is immediate or real-time.
– The techniques under deferred data
extraction do not capture the changes in real
time. The capture happens later.
• Capture Based on Date and Time Stamp
– Every time a source record is created or updated
it may be marked with a stamp showing the date
and time.
– The time stamp provides the basis for selecting
records for data extraction.
– The relevant source records should contain date
and time stamps.
– Here the data capture occurs at a later time, not
while each source record is created or updated.
• Capture Based on Date and Time Stamp
– This technique works well if the number of
revised records is small.
– This technique can work for any type of
source file.
– This technique captures the latest state of
the source data.
– Any intermediary states between two data
extraction runs are lost.
• Capture by Comparing Files
– This technique is also called the snapshot
differential technique because it compares two
snapshots of the source data.
– To apply this technique, while performing today’s
data extraction for changes to product data, you
do a full file comparison between today’s copy of
the product data and yesterday’s copy.
– You also compare the record keys to find the
inserts and deletes. Then you capture any
changes between the two copies.
• Capture by Comparing Files
– This technique necessitates the keeping of
prior copies of all the relevant source
data.
– Not effective for large files.
– Considered as the only feasible option for
some legacy data sources with no
transaction logs or time stamps on source
records.
Contents
1. Introduction
2. ETL Requirements
3. Data Extraction
• Data in the operational systems is not usable
for DW purpose (quality, inconsistent …)
• First, all the extracted data must be made
usable in the DW.
• Having information that is usable for
strategic decision making is the underlying
principle of the DW.
• You have to enrich and improve the quality
of the data before it can be usable in the DW.
• Various kinds of data transformations should
be applied into extracted data.
• Transformation should apply according to standards
of source systems.
–A wide variety of manipulations to change all
the extracted source data into usable
information
–After the data put together, the combined data
should not violate business rules.
–Data formats, data values, and the condition of
the data quality should be considered.
• Due to transformation complexity, many
organizations start out with a simple departmental
data mart as the pilot project.
• Data integration may refers to:
• The preprocessing data process within the data
transformation function.
• The mapping of the source fields to the target fields
in the DW.
• One major effort within data transformation is
the improvement of data quality (filling in the
missing values for attributes in the extracted
data).
– Data quality is of paramount importance in the DW
because the effect of strategic decisions based on
incorrect information can be devastating.
• First, you clean the data extracted from each
source.
• Cleaning may just be:
– Correction of misspellings,
– Resolution of conflicts,
– Deal with providing default values for
missing data elements,
– Elimination of duplicates when you bring
in the same data from multiple source
systems.
• Standardization of data elements forms a
large part of data transformation.
• You standardize the data types and field
lengths for same data elements retrieved
from the various sources.
• Semantic standardization is another major
task. You resolve synonyms and homonyms.
– When two or more terms from different source
systems mean the same thing, you resolve the
synonyms.
– When a single term means many different things in
different source systems, you resolve the homonym.
• Data transformation involves combining
processes.
– Combining data from single source record or related
data elements from many source records.
• Sorting and merging of data takes place on a
large scale in the data staging area.
• In many cases, the keys chosen for the ODs are
field values with built-in meanings.
– For example, the product key value may be a
combination of characters indicating the product
category, the code of the warehouse where the
product is stored, and some code to show the
production batch.
• Primary keys in the DW cannot have built-in
meanings.
End of ETL-1

Exam

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exam

Uploaded by

Copyright:

Available Formats

2.

Data Warehouse (DW)

• The cuboid that holds the lowest level of

Alaa Khalaf Hamoud

2.3 Design Considerations

2.5 Data Distribution

2.6 Performance Considerations

Alaa Khalaf Hamoud

You might also like