You are on page 1of 12

Bahria University

Lahore Campus

Assignment # 5
Name: Aqsa Gulzar

Enrollment No: 03-134171-005

Program: BSCS(7A)

Semester: 7

Course title: Data Warehouse

Instructor Name: Sir Junaid

Date assigned: 07/03/2020

Date of submission: 24/04/2020


Q#1. Define the Process of ETL (Extract Transform Load) its different tools and
methodologies.

Ans: ETL Process


ETL is the process by which data is extracted from data sources (that are not optimized for
analytics) and moved to a central host. The exact steps in that process might differ from one
ETL tool to the next, but the end result is the same.
At its most basic, the ETL process encompasses data extraction, transformation, and loading.
While the abbreviation implies a neat, three-step process – extract, transform, load – this
simple definition doesn’t capture:
 The transportation of data
 The overlap between each of these stages
 How new technologies are changing this flow

Traditional ETL process

Step 1: Extraction
In this step, data is extracted from the source system into the staging area. Transformations if
any are done in staging area so that performance of source system in not degraded. Also, if
corrupted data is copied directly from the source into Data warehouse database, rollback will
be a challenge. Staging area gives an opportunity to validate extracted data before it moves
into the Data warehouse.
Three Data Extraction methods:
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification
Irrespective of the method used, extraction should not affect performance and response time
of the source systems. These source systems are live production databases. Any slow down or
locking could affect company's bottom line.

Step 2: Transformation
Data extracted from source server is raw and not usable in its original form. Therefore, it
needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL
process adds value and changes data such that insightful BI reports can be generated. In this
step, you apply a set of functions on extracted data. Data that does not require any
transformation is called as direct move or pass through data. In transformation step, you can
perform customized operations on data. For instance, if the user wants sum-of-sales revenue
which is not in the database. Or if the first name and the last name in a table is in different
columns. It is possible to concatenate them before loading.

Step 3: Loading

Loading data into the target data warehouse database is the last step of the ETL process. In a
typical Data warehouse, huge volume of data needs to be loaded in a relatively short period
(nights). Hence, load process should be optimized for performance.

In case of load failure, recover mechanisms should be configured to restart from the point of
failure without data integrity loss. Data Warehouse admins need to monitor, resume, cancel
loads as per prevailing server performance.

Types of Loading:

 Initial Load — populating all the Data Warehouse tables


 Incremental Load — applying ongoing changes as when needed periodically.
 Full Refresh —erasing the contents of one or more tables and reloading with fresh
data.

ETL tools

Here are 8 of the best ETL software tools for 2020 and beyond:
1. Improvado
2. AWS Glue
3. Xplenty
4. Alooma
5. Talend
6. Stitch
7. Informatica PowerCenter
8. Oracle Data Integrator

Q#2. Explain dimensional modelling and its usage.


Ans: Dimensional Modeling
DIMENSIONAL MODELING (DM) is a data structure technique optimized for data storage
in a Data warehouse. The purpose of dimensional model is to optimize the database for fast
retrieval of data. The concept of Dimensional Modelling was developed by Ralph Kimball
and consists of "fact" and "dimension" tables.
A Dimensional model is designed to read, summarize, analyze numeric information like
values, balances, counts, weights, etc. in a data warehouse. In contrast, relation models are
optimized for addition, updating and deletion of data in a real-time Online Transaction
System.

Elements of Dimensional Data Model

Fact

Facts are the measurements/metrics or facts from your business process. For a Sales business
process, a measurement would be quarterly sales number

Dimension

Dimension provides the context surrounding a business process event. In simple terms, they
give who, what, where of a fact.

Attributes

The Attributes are the various characteristics of the dimension.

In the Location dimension, the attributes can be

 State
 Country
 Zipcode etc.

Attributes are used to search, filter, or classify facts. Dimension Tables contain Attributes

Fact Table

A fact table is a primary table in a dimensional model.

A Fact Table contains

1. Measurements/facts
2. Foreign key to dimension table

Dimension table

 A dimension table contains dimensions of a fact.


 They are joined to fact table via a foreign key.
 Dimension tables are de-normalized tables.
 The Dimension Attributes are the various columns in a dimension table
 Dimensions offers descriptive characteristics of the facts with the help of their
attributes
 No set limit set for given for number of dimensions
 The dimension can also contain one or more hierarchical relationships
Steps of Dimensional Modelling

The accuracy in creating your Dimensional modeling determines the success of your data
warehouse implementation. Here are the steps to create Dimension Model

1. Identify Business Process


2. Identify Grain (level of detail)
3. Identify Dimensions
4. Identify Facts
5. Build Star

Step 1: Identify the business process

Identifying the actual business process a data warehouse should cover. This could be
Marketing, Sales, HR, etc. as per the data analysis needs of the organization.

Step 2: Identify the grain

The Grain describes the level of detail for the business problem/solution. It is the process of
identifying the lowest level of information for any table in your data warehouse. If a table
contains sales data for every day, then it should be daily granularity. If a table contains total
sales data for each month, then it has monthly granularity.

Step 3: Identify the dimensions


Dimensions are nouns like date, store, inventory, etc. These dimensions are where all the data
should be stored. For example, the date dimension may contain data like a year, month and
weekday.

Step 4: Identify the Fact

This step is co-associated with the business users of the system because this is where they get
access to data stored in the data warehouse. Most of the fact table rows are numerical values
like price or cost per unit, etc.

Step 5: Build Schema

In this step, you implement the Dimension Model. A schema is nothing but the database
structure (arrangement of tables). There are two popular schemas

Star Schema

The star schema architecture is easy to design. It is called a star schema because diagram
resembles a star, with points radiating from a center. The center of the star consists of the fact
table, and the points of the star is dimension tables.

The fact tables in a star schema which is third normal form whereas dimensional tables are
de-normalized.

Snowflake Schema

The snowflake schema is an extension of the star schema. In a snowflake schema, each
dimension are normalized and connected to more dimension tables.

Rules for Dimensional Modelling

 Load atomic data into dimensional structures.


 Build dimensional models around business processes.
 Need to ensure that every fact table has an associated date dimension table.
 Ensure that all facts in a single fact table are at the same grain or level of detail.
 It's essential to store report labels and filter domain values in dimension tables
 Need to ensure that dimension tables use a surrogate key
 Continuously balance requirements and realities to deliver business solution to
support their decision-making

Benefits of dimensional modeling

 Standardization of dimensions allows easy reporting across areas of the business.


 Dimension tables store the history of the dimensional information.
 It allows to introduced entirely new dimension without major disruptions to the fact
table.
 Dimensional also to store data in such a fashion that it is easier to retrieve the
information from the data once the data is stored in the database.
 Compared to the normalized model dimensional table are easier to understand.
 Information is grouped into clear and simple business categories.
 The dimensional model is very understandable by the business. This model is based
on business terms, so that the business knows what each fact, dimension, or attribute
means.
 Dimensional models are deformalized and optimized for fast data querying. Many
relational database platforms recognize this model and optimize query execution plans
to aid in performance.
 Dimensional modeling creates a schema which is optimized for high performance. It
means fewer joins and helps with minimized data redundancy.
 The dimensional model also helps to boost query performance. It is more
denormalized therefore it is optimized for querying.
 Dimensional models can comfortably accommodate change. Dimension tables can
have more columns added to them without affecting existing business intelligence
applications using these tables.

Q#3. Explain with the help of example the process of de normalization.


Ans: Denormalization in Databases
Denormalization is a database optimization technique in which we add redundant data to one
or more tables. This can help us avoid costly joins in a relational database. Note that
denormalization does not mean not doing normalization. It is an optimization technique that
is applied after doing normalization.
In a traditional normalized database, we store data in separate logical tables and attempt to
minimize redundant data. We may strive to have only one copy of each piece of data in
database.

Example
Imagine that users of our email messaging service want to access messages by category.
Keeping the name of a category right in the User_messages table can save time and reduce
the number of necessary joins.

In the denormalized table above, we introduced the category_name column to store


information about which category each record in the User_messages table is related to.
Thanks to denormalization, only a query on the User_messages table is required to enable a
user to select all messages belonging to a specific category. Of course, this denormalization
technique has a downside − this extra column may require a lot of storage space.

Q#4. What is the difference between Olap, Molap, Rolap and Holap.
Ans:
OLAP ROLAP MOLAP HOLAP
OLAP stands for ROLAP stands for MOLAP stands for HOLAP stands for
Online Analytical Relational Online Multidimensional Hybrid Online
Processing Analytical Online Analytical Analytical
Processing. Processing.
Processing.

OLAP is a This methodology This is the more HOLAP


computing method relies on traditional way of technologies attempt
that enables users to manipulating the OLAP analysis. In to combine the
easily and data stored in the MOLAP, data is advantages of
selectively extract relational database stored in a MOLAP and
and query data in to give the multidimensional ROLAP. For
order to analyze it appearance of cube. The storage is summary-type
from different points traditional OLAP's not in the relational information,
of view. OLAP slicing and dicing database, but in HOLAP leverages
business intelligence functionality. In proprietary formats. cube technology for
queries often aid in essence, each action faster performance.
trends analysis, of slicing and dicing When detail
financial reporting, is equivalent to information is
sales forecasting, adding a "WHERE" needed, HOLAP can
budgeting and other clause in the SQL "drill through" from
planning purposes. statement. the cube into the
underlying relational
data.

Q#5. Define Data Quality Management and its usage in data warehouse and how we can we
implement it.
Ans: Data quality management is a set of practices that aim at maintaining a
high quality of information. DQM goes all the way from the acquisition of data and the
implementation of advanced data processes, to an effective distribution of data. It also
requires a managerial oversight of the information you have.

The DWH Quality Management:


 Delivers end-to-end quality solutions
 Enforces Data Quality and Data Profiling as important processes during
implementation of data warehouse
 Keeps a check on the metadata and its storage repository to ensure
 Generates mappings for data correction based on business rules and ethics.

There are primarily four phases in data quality management lifecycle:

 Quality Assessment
 Quality Design
 Quality Transformation
 Quality Monitoring

In the Quality assessment phase, the quality of the source data is determined by adopting the
process of Data Profiling. Data profiling discovers and unravels irregularities, inconsistencies
and redundancy occurring in the content, structure and relationships within data. Thus, you
can assess a list down the data anomalies before proceeding further.

The next phase refers to Quality design, which enables business people and groups to design
their quality processes. For instance, individuals can enumerate legal data and relationships
within data objects complying the data standards and rules. In this management step, the
managers and administrators also rectify and improve the data using data quality operators.
Similarly, they can also design data transformations or data mappings to ensure quality.

Next, the Quality Transformation phase runs correction mappings used for correcting the
source data.

The last phase of this cycle includes Quality Monitoring, which refers to the examining and
investigating the data at different time intervals and receiving notification if the data breaches
any business standards or rules.

Data Profiling process integrates with ETL processes in the data warehouse including the
cleaning algorithms and other data rules and schemas specified. It helps users to find:

 a domain of valid product codes


 product discounts
 columns having email address patterns
 data inconsistencies and anomalies within columns
 relations between columns and tables

Such findings will enable you to manage data and data warehousing in better way.

Q#6. Differentiate between following:


a) Cubes Vs. Dimensional Modelling
b) Data Base Table Vs. Fact Table of cubes
Ans:
a) Cubes Vs. Dimensional Modelling:
 Dimensional Modelling:
Dimensional modelling concepts are applicable wither implementing the model in a
cube or a relational database. Benefits of the dimensional model are the following:
Understandability. Compared to the normalized model, the dimensional model is
easier to understand and more intuitive. In dimensional models, information is
grouped into coherent business categories or dimensions, making it easier to read and
interpret.
 Cubes:
The only differences are in physical implementation. There are limitations in a cube
(multidimensional database, or MDDB) that are not an issue in a relational
implementation. Cubes are data processing units composed of fact tables and
dimensions from the data warehouse. They provide multidimensional views of data,
querying and analytical capabilities to clients. A cube can be stored on a single
analysis server and then defined as a linked cube on other Analysis servers.

b) Data Base Table Vs. Fact Table of cubes


 Database Table:
A table is a collection of related data held in a table format within a database. It
consists of columns and rows. In relational databases, and flat file databases, a table is
a set of data elements using a model of vertical columns and horizontal rows, the cell
being the unit where a row and column intersect. A table is a data structure that
organizes information into rows and columns. It can be used to both store and display
data in a structured format. For example, databases store data in tables so that
information can be quickly accessed from specific rows.
 Fact Table of cubes:
In data warehousing, a fact table consists of the measurements, metrics or facts of a
business process. It is located at the center of a star schema or a snowflake schema
surrounded by dimension tables. Each record in this fact table is therefore uniquely
defined by a day, product and store. A fact table represents this theme. Facts are
numerical measures. Thus, the fact table contains measure and keys to each of the
related dimensional tables. Dimensions are a fact that defines a data cube. Facts are
generally quantities, which are used for analyzing the relationship between
dimensions.

Q#7. What are the different Association rules of Data Mining and there algorithms give
example.

Ans:

Association rules are if-then statements that help to show the probability of relationships

between data items within large data sets in various databases.

An association rule has two parts:

a) Antecedent (if) is an item found within the data.


b) Consequent (then) is an item found in combination with the antecedent.

Algorithms that use association rules

 AIS
 SETM
 Apriori
 AIS algorithm
In AIS item sets are generated and counted as it scans the data. In transaction
data, the AIS algorithm determines which large item sets contained a
transaction, and new candidate item sets are created by extending the large
item sets with other items in the transaction data
 SETM algorithm
It generates candidate item sets as it scans a database, but this algorithm
accounts for the item sets at the end of its scan. New candidate item sets are
generated the same way as with the AIS algorithm, but the transaction ID of
the generating transaction is saved with the candidate item set in a sequential
structure.
 Apriori Algorithm
In Apriori algorithm the candidate item sets are generated using only the large
item sets of the previous pass. The large item set of the previous pass is joined
with itself to generate all item sets with a size that's larger by one.

Examples of association rules in data mining


A classic example of association rule mining refers to a relationship between diapers and
beers. The example, which seems to be fictional, claims that men who go to a store to buy
diapers are also likely to buy beer. Data that would point to that might look like this A
supermarket has 200,000 customer transactions. About 4,000 transactions, or about 2% of
total transactions, include the purchase of diapers. About 5,500 transactions (2.75%) include
the purchase of beer. Of those, about 3,500 transactions, 1.75%, include both the purchase of
diapers and beer. Based on the percentages, that number should be much lower. However, the
fact that about 87.5% of diaper purchases include the purchase of beer indicates a link
between diapers and beer.

Q#8. What are the different tools we can use to implement Data warehouse.
Ans: Top Pick Of 10 Data Warehouse Tools

Enlisted below are the most popular Data Warehouse tools that are available in the market.

1. Amazon Redshift
2. BigQuery
3. Panoply
4. Teradata
5. Oracle 12c
6. Informatica
7. IBM Infosphere
8. Ab Initio Software
9. ParAccel (acquired by Actian)
10. Cloudera
Q#9. Give at least four reasons why we De-Normalize the database.
Ans:

1. To enhance Query Performance

Typically, a normalized database requires joining tons of tables to fetch queries, but the more
joins, the slower the query. As a countermeasure, you will add redundancy to a database by
copying values between parent and child tables and, therefore, reducing the number of joins
required for a question.

2. To make database more convenient to manage

A normalized database does not have calculated values that are essential for applications.
Calculating these values on-the-fly would require time, slowing down query execution. You
can deformalize a database to supply calculated values

3. To facilitate and accelerate reporting

Often, applications got to provide tons of analytical and statistical information. Generating


reports from live data is time-consuming and may negatively impact overall system.
performance. Deformalizing your database can assist you meet this challenge.

4. Reduce the number of foreign keys in relations

Q#10. If de-normalization improves data warehouse processes, why fact table is in normal
form?

Ans: In general Fact table is normalized and Dimension table is de-normalized. So that you
will get all required information about the fact by joining the dimension in STAR schema. In
some cases where dimensions are bulky then we snowflake it and make it normalized.
Basically, the fact table consists of the Index keys of the dimension took up tables and the
measures. so, whenever we have the keys in a table that itself implies that the table is in the
normal form. Most databases use a normalized data structure. Therefore, data warehouses
normally use a denormalized data structure. A denormalized data structure uses fewer tables
because it groups data and doesn't exclude data redundancies. Denormalization offers better
performance when reading data for analytical purposes. Fact less fact tables are used for
tracking a process or collecting stats. They are called so because, the fact table does not have
aggregately numeric values or information. There are two types of fact less fact tables: those
that describe events, and those that describe conditions.

You might also like