Professional Documents
Culture Documents
Lahore Campus
Assignment # 5
Name: Aqsa Gulzar
Program: BSCS(7A)
Semester: 7
Step 1: Extraction
In this step, data is extracted from the source system into the staging area. Transformations if
any are done in staging area so that performance of source system in not degraded. Also, if
corrupted data is copied directly from the source into Data warehouse database, rollback will
be a challenge. Staging area gives an opportunity to validate extracted data before it moves
into the Data warehouse.
Three Data Extraction methods:
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification
Irrespective of the method used, extraction should not affect performance and response time
of the source systems. These source systems are live production databases. Any slow down or
locking could affect company's bottom line.
Step 2: Transformation
Data extracted from source server is raw and not usable in its original form. Therefore, it
needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL
process adds value and changes data such that insightful BI reports can be generated. In this
step, you apply a set of functions on extracted data. Data that does not require any
transformation is called as direct move or pass through data. In transformation step, you can
perform customized operations on data. For instance, if the user wants sum-of-sales revenue
which is not in the database. Or if the first name and the last name in a table is in different
columns. It is possible to concatenate them before loading.
Step 3: Loading
Loading data into the target data warehouse database is the last step of the ETL process. In a
typical Data warehouse, huge volume of data needs to be loaded in a relatively short period
(nights). Hence, load process should be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart from the point of
failure without data integrity loss. Data Warehouse admins need to monitor, resume, cancel
loads as per prevailing server performance.
Types of Loading:
ETL tools
Here are 8 of the best ETL software tools for 2020 and beyond:
1. Improvado
2. AWS Glue
3. Xplenty
4. Alooma
5. Talend
6. Stitch
7. Informatica PowerCenter
8. Oracle Data Integrator
Fact
Facts are the measurements/metrics or facts from your business process. For a Sales business
process, a measurement would be quarterly sales number
Dimension
Dimension provides the context surrounding a business process event. In simple terms, they
give who, what, where of a fact.
Attributes
State
Country
Zipcode etc.
Attributes are used to search, filter, or classify facts. Dimension Tables contain Attributes
Fact Table
1. Measurements/facts
2. Foreign key to dimension table
Dimension table
The accuracy in creating your Dimensional modeling determines the success of your data
warehouse implementation. Here are the steps to create Dimension Model
Identifying the actual business process a data warehouse should cover. This could be
Marketing, Sales, HR, etc. as per the data analysis needs of the organization.
The Grain describes the level of detail for the business problem/solution. It is the process of
identifying the lowest level of information for any table in your data warehouse. If a table
contains sales data for every day, then it should be daily granularity. If a table contains total
sales data for each month, then it has monthly granularity.
This step is co-associated with the business users of the system because this is where they get
access to data stored in the data warehouse. Most of the fact table rows are numerical values
like price or cost per unit, etc.
In this step, you implement the Dimension Model. A schema is nothing but the database
structure (arrangement of tables). There are two popular schemas
Star Schema
The star schema architecture is easy to design. It is called a star schema because diagram
resembles a star, with points radiating from a center. The center of the star consists of the fact
table, and the points of the star is dimension tables.
The fact tables in a star schema which is third normal form whereas dimensional tables are
de-normalized.
Snowflake Schema
The snowflake schema is an extension of the star schema. In a snowflake schema, each
dimension are normalized and connected to more dimension tables.
Example
Imagine that users of our email messaging service want to access messages by category.
Keeping the name of a category right in the User_messages table can save time and reduce
the number of necessary joins.
Q#4. What is the difference between Olap, Molap, Rolap and Holap.
Ans:
OLAP ROLAP MOLAP HOLAP
OLAP stands for ROLAP stands for MOLAP stands for HOLAP stands for
Online Analytical Relational Online Multidimensional Hybrid Online
Processing Analytical Online Analytical Analytical
Processing. Processing.
Processing.
Q#5. Define Data Quality Management and its usage in data warehouse and how we can we
implement it.
Ans: Data quality management is a set of practices that aim at maintaining a
high quality of information. DQM goes all the way from the acquisition of data and the
implementation of advanced data processes, to an effective distribution of data. It also
requires a managerial oversight of the information you have.
Quality Assessment
Quality Design
Quality Transformation
Quality Monitoring
In the Quality assessment phase, the quality of the source data is determined by adopting the
process of Data Profiling. Data profiling discovers and unravels irregularities, inconsistencies
and redundancy occurring in the content, structure and relationships within data. Thus, you
can assess a list down the data anomalies before proceeding further.
The next phase refers to Quality design, which enables business people and groups to design
their quality processes. For instance, individuals can enumerate legal data and relationships
within data objects complying the data standards and rules. In this management step, the
managers and administrators also rectify and improve the data using data quality operators.
Similarly, they can also design data transformations or data mappings to ensure quality.
Next, the Quality Transformation phase runs correction mappings used for correcting the
source data.
The last phase of this cycle includes Quality Monitoring, which refers to the examining and
investigating the data at different time intervals and receiving notification if the data breaches
any business standards or rules.
Data Profiling process integrates with ETL processes in the data warehouse including the
cleaning algorithms and other data rules and schemas specified. It helps users to find:
Such findings will enable you to manage data and data warehousing in better way.
Q#7. What are the different Association rules of Data Mining and there algorithms give
example.
Ans:
Association rules are if-then statements that help to show the probability of relationships
AIS
SETM
Apriori
AIS algorithm
In AIS item sets are generated and counted as it scans the data. In transaction
data, the AIS algorithm determines which large item sets contained a
transaction, and new candidate item sets are created by extending the large
item sets with other items in the transaction data
SETM algorithm
It generates candidate item sets as it scans a database, but this algorithm
accounts for the item sets at the end of its scan. New candidate item sets are
generated the same way as with the AIS algorithm, but the transaction ID of
the generating transaction is saved with the candidate item set in a sequential
structure.
Apriori Algorithm
In Apriori algorithm the candidate item sets are generated using only the large
item sets of the previous pass. The large item set of the previous pass is joined
with itself to generate all item sets with a size that's larger by one.
Q#8. What are the different tools we can use to implement Data warehouse.
Ans: Top Pick Of 10 Data Warehouse Tools
Enlisted below are the most popular Data Warehouse tools that are available in the market.
1. Amazon Redshift
2. BigQuery
3. Panoply
4. Teradata
5. Oracle 12c
6. Informatica
7. IBM Infosphere
8. Ab Initio Software
9. ParAccel (acquired by Actian)
10. Cloudera
Q#9. Give at least four reasons why we De-Normalize the database.
Ans:
Typically, a normalized database requires joining tons of tables to fetch queries, but the more
joins, the slower the query. As a countermeasure, you will add redundancy to a database by
copying values between parent and child tables and, therefore, reducing the number of joins
required for a question.
A normalized database does not have calculated values that are essential for applications.
Calculating these values on-the-fly would require time, slowing down query execution. You
can deformalize a database to supply calculated values
Q#10. If de-normalization improves data warehouse processes, why fact table is in normal
form?
Ans: In general Fact table is normalized and Dimension table is de-normalized. So that you
will get all required information about the fact by joining the dimension in STAR schema. In
some cases where dimensions are bulky then we snowflake it and make it normalized.
Basically, the fact table consists of the Index keys of the dimension took up tables and the
measures. so, whenever we have the keys in a table that itself implies that the table is in the
normal form. Most databases use a normalized data structure. Therefore, data warehouses
normally use a denormalized data structure. A denormalized data structure uses fewer tables
because it groups data and doesn't exclude data redundancies. Denormalization offers better
performance when reading data for analytical purposes. Fact less fact tables are used for
tracking a process or collecting stats. They are called so because, the fact table does not have
aggregately numeric values or information. There are two types of fact less fact tables: those
that describe events, and those that describe conditions.