Mid Syllabus DWH

Components or building blocks of DWH
Architecture is the proper arrangement of the elements. We build a data warehouse with software and
hardware components. To suit the requirements of our organizations, we arrange these building we may
want to boost up another part with extra tools and services. All of these depends on our circumstances.
The figure shows the essential elements of a typical warehouse. We see the Source Data component
shows on the left. The Data staging element serves as the next building block. In the middle, we see the
Data Storage component that handles the data warehouses data. This element not only stores and
manages the data; it also keeps track of data using the metadata repository. The Information Delivery
component shows on the right consists of all the different ways of making the information from the data
warehouses available to the users.
Source Data Component

Source data coming into the data warehouses may be grouped into four broad categories:
Production Data: This type of data comes from the different operating systems of the enterprise. Based
on the data requirements in the data warehouse, we choose segments of the data from the various
operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports, customer
profiles, and sometimes even department databases. This is the internal data, part of which could be
useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In every
operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large percentage of
the information they use. They use statistics associating to their industry produced by the external
department.
Data staging component
After we have been extracted data from various operational systems and external sources, we
have to prepare the files for storing in the data warehouse. The extracted data coming from
several different sources need to be changed, converted, and made ready in a format that is
relevant to be saved for querying and analysis.
We will now discuss the three primary functions that take place in the staging area.
1) Data Extraction: This method has to deal with numerous data sources. We have to employ the
appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many different sources. If
data extraction for a data warehouse posture big challenges, data transformation present even
significant challenges. We perform several individual tasks as part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of misspellings or
may deal with providing default values for missing data elements, or elimination of duplicates when we
bring in the same data from various source systems.
Standardization of data components forms a large part of data transformation. Data transformation
contains many forms of combining pieces of data from different sources. We combine data from single
source record or related data parts from many source records.
On the other hand, data transformation also contains purging source data that is not useful and
separating outsource records into new combinations. Sorting and merging of data take place on a large
scale in the data staging area. When the data transformation function ends, we have a collection of
integrated data that is cleaned, standardized, and summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we complete the
structure and construction of the data warehouse and go live for the first time, we do the initial loading
of the information into the data warehouse storage. The initial load moves high volumes of data using
up a substantial amount of time.
Data Storage Components
Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data repositories
include the data structured in highly normalized for fast and efficient processing.
Information Delivery Components

The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the data about the logical data structures, the data
about the records and addresses, the information about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users. The scope is
confined to particular selected subjects. Data in a data warehouse should be a fairly current, but not
mainly up to the minute, although development in the data warehouse industry has made standard and
incremental data dumps more achievable. Data marts are lower than data warehouses and usually
contain organization. The current trends in data warehousing are to developed a data warehouse with
several smaller related data marts for particular kinds of queries and reports.
Management and Control Components
The management and control elements coordinate the services and functions within the data
warehouse. These components control the data transformation and the data transfer into the data
warehouse storage. On the other hand, it moderates the data delivery to the clients. Its work with the
database management systems and authorizes data to be correctly saved in the repositories. It monitors
the movement of information into the staging method and from there into the data warehouses storage
itself.
Why we need a separate Data Warehouse?

Data Warehouse queries are complex because they involve the computation of large groups of data at
summarized levels.
It may require the use of distinctive data organization, access, and implementation method based on
multidimensional views.
Performing OLAP queries in operational database degrade the performance of functional tasks.
Data Warehouse is used for analysis and decision making in which extensive database is required,
including historical data, which operational database does not typically maintain.
The separation of an operational database from data warehouses is based on the different structures
and uses of data in these systems.
Because the two systems provide different functionalities and require different kinds of data, it is
necessary to maintain separate databases.
Database VS Data Warehouse
1:
It is used for Online Transactional Processing (OLTP) but can be used for other objectives such as Data
Warehousing. This records the data from the clients for history.
It is used for Online Analytical Processing (OLAP). This reads the historical information for the customers
for business decisions.
2:
The tables and joins are complicated since they are normalized for RDBMS. This is done to reduce
redundant files and to save storage space.
The tables and joins are accessible since they are de-normalized. This is done to minimize the response
time for analytical queries.
3. Data is dynamic 3. Data is largely static
4. Entity: Relational modeling procedures are used for RDBMS database design.
4. Data: Modeling approach are used for the Data Warehouse design.
5. Optimized for write operations. 5. Optimized for read operations.
6. Performance is low for analysis queries. 6. High performance for analytical queries.
7:
The database is the place where the data is taken as a base and managed to get available fast and
efficient access.
Data Warehouse is the place where the application data is handled for analysis and reporting
objectives.
What is Dimensional Modeling?

Dimensional modeling represents data with a cube operation, making more suitable logical data
representation with OLAP data management. The perception of Dimensional Modeling was developed
by Ralph Kimball and is consist of "fact" and "dimension" tables.
In dimensional modeling, the transaction record is divided into either "facts," which are frequently
numerical transaction data, or "dimensions," which are the reference information that gives context to
the facts. For example, a sale transaction can be damage into facts such as the number of products
ordered and the price paid for the products, and into dimensions such as order date, user name,
product number, order ship-to, and bill-to locations, and salesman responsible for receiving the order.
Objectives of Dimensional Modeling
The purposes of dimensional modeling are:
To produce database architecture that is easy for end-clients to understand and write queries.
To maximize the efficiency of queries. It achieves these goals by minimizing the number of tables and
relationships between them.
Advantages of Dimensional Modeling

Following are the benefits of dimensional modeling are:
Dimensional modeling is simple: Dimensional modeling methods make it possible for warehouse
designers to create database schemas that business customers can easily hold and comprehend. There
is no need for vast training on how to read diagrams, and there is no complicated relationship between
different data elements.
Dimensional modeling promotes data quality: The star schema enable warehouse administrators to
enforce referential integrity checks on the data warehouse. Since the fact information key is a
concatenation of the essentials of its associated dimensions, a factual record is actively loaded if the
corresponding dimensions records are duly described and also exist in the database.
By enforcing foreign key constraints as a form of referential integrity check, data warehouse DBAs add a
line of defense against corrupted warehouses data.
Performance optimization is possible through aggregates: As the size of the data warehouse increases,
performance optimization develops into a pressing concern. Customers who have to wait for hours to
get a response to a query will quickly become discouraged with the warehouses. Aggregates are one of
the easiest methods by which query performance can be optimized.
Disadvantages of Dimensional Modeling

To maintain the integrity of fact and dimensions, loading the data warehouses with a record from
various operational systems is complicated.
It is severe to modify the data warehouse operation if the organization adopting the dimensional
technique changes the method in which it does business.
Elements of Dimensional Modeling

Fact
It is a collection of associated data items, consisting of measures and context data. It typically represents
business items or business transactions.
Dimensions
It is a collection of data which describe one business dimension. Dimensions decide the contextual
background for the facts, and they are the framework over which OLAP is performed.
Measure
It is a numeric attribute of a fact, representing the performance or behavior of the business relative to
the dimensions.
Considering the relational context, there are two basic models which are used in dimensional modeling:
1:Star Model
2:Snowflake Model
The star model is the underlying structure for a dimensional model. It has one broad central table (fact
table) and a set of smaller tables (dimensions) arranged in a radial design around the primary table. The
snowflake model is the conclusion of decomposing one or more of the dimensions.
Fact Table
Fact tables are used to data facts or measures in the business. Facts are the numeric data elements that
are of interest to the company.
Characteristics of the Fact table
The fact table includes numerical values of what we measure. For example, a fact value of 20 might
means that 20 widgets have been sold.
Each fact table includes the keys to associated dimension tables. These are known as foreign keys in the
fact table.
Fact tables typically include a small number of columns.
When it is compared to dimension tables, fact tables have a large number of rows.
Dimension Table
Dimension tables establish the context of the facts. Dimensional tables store fields that describe the
facts.
Characteristics of the Dimension table
Dimension tables contain the details about the facts. That, as an example, enables the business analysts
to understand the data and their reports better.
The dimension tables include descriptive data about the numerical values in the fact table. That is, they
contain the attributes of the facts. For example, the dimension tables for a marketing analysis function
might include attributes such as time, marketing region, and product type.
Since the record in a dimension table is denormalized, it usually has a large number of columns. The
dimension tables include significantly fewer rows of information than the fact table.
The attributes in a dimension table are used as row and column headings in a document or query results
display.
Example: A city and state can view a store summary in a fact table. Item summary can be viewed by
brand, color, etc. Customer information can be viewed by name and address.
Fact Table
Time ID Product ID Customer ID Unit Sold
4 17 2 1
8 21 3 2
8 4 1 1
In this example, Customer ID column in the facts table is the foreign keys that join with the dimension
table. By following the links, we can see that row 2 of the fact table records the fact that customer 3,
Gaurav, bought two items on day 8.
Dimension Tables
Customer ID Name Gender Income Education Region

Region
1 Rohan Male 2 3 4
2 Sandeep Male 3 5 1
3 Gaurav Male 1 7 3
Hierarchy
A hierarchy is a directed tree whose nodes are dimensional attributes and whose arcs model many to
one association between dimensional attributes team. It contains a dimension, positioned at the tree's
root, and all of the dimensional attributes that define it.
STAR SCHEMA
What is Star Schema?

A star schema is the elementary form of a dimensional model, in which data are organized into facts and
dimensions. A fact is an event that is counted or measured, such as a sale or log in. A dimension includes
reference data about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of this schemas simulates a star, with points, diverge
from a central table. The center of the schema consists of a large fact table, and the points of the star
are the dimension tables.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table has two types of
columns: those that include fact and those that are foreign keys to the dimension table. The primary key
of the fact tables is generally a composite key that is made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables that
include aggregated fact are often instead called summary tables). A fact table generally contains facts
with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a
dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of each
of the dimensions table are part of the composite primary keys of the fact table. Dimensional attributes
help to define the dimensional value. They are generally descriptive, textual values. Dimensional tables
are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region (markets,
cities), clients, products, times, channels.
Characteristics of Star Schema

The star schema is intensely suitable for data warehouse database design because of the following
features:
It creates a DE-normalized database that can quickly provide query responses.
It provides a flexible design that can be changed easily or added to throughout the development cycle,
and as the database grows.
It provides a parallel in design to how end-users typically think of and use the data.
It reduces the complexity of metadata for both developers and end-users.
Advantages of Star Schema
Star Schemas are easy for end-users and application to understand and navigate. With a well-
designed schema, the customer can instantly analyze large, multidimensional data sets.
The main advantage of star schemas in a decision-support environment are:
Query Performance
A star schema database has a limited number of table and clear join paths, the query run faster
than they do against OLTP systems. Small single-table queries, frequently of a dimension table,
are almost instantaneous. Large join queries that contain multiple tables takes only seconds or
minutes to run.
In a star schema database design, the dimension is connected only through the central fact
table. When the two-dimension table is used in a query, only one join path, intersecting the fact
tables, exist between those two tables. This design feature enforces authentic and consistent
query results.
Load performance and administration
Structural simplicity also decreases the time required to load large batches of record into a star
schema database. By describing facts and dimensions and separating them into the various
table, the impact of a load structure is reduced. Dimension table can be populated once and
occasionally refreshed. We can add new facts regularly and selectively by appending records to
a fact table.
Built-in referential integrity
A star schema has referential integrity built-in when information is loaded. Referential integrity
is enforced because each data in dimensional tables has a unique primary key, and all keys in
the fact table are legitimate foreign keys drawn from the dimension table. A record in the fact
table which is not related correctly to a dimension cannot be given the correct key value to be
retrieved.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only through the
fact table. These joins are more significant to the end-user because they represent the
fundamental relationship between parts of the underlying business. Customer can also browse
dimension table attributes before constructing a query.
Disadvantage of Star Schema
There is some condition which cannot be meet by star schemas like the relationship between
the user, and bank account cannot describe as star schema as the relationship between them is
many to many.
Example: Suppose a star schema is composed of a fact table, SALES, and several dimension
tables connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has
columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH table has
columns for each branch_key, branch_name, branch_type. The LOCATION table has columns of
geographic data, including street, city, state, and country.
n this scenario, the SALES table contains only four columns with IDs from the dimension tables, TIME,
ITEM, BRANCH, and LOCATION, instead of four columns for time data, four columns for ITEM data, three
columns for BRANCH data, and four columns for LOCATION data. Thus, the size of the fact table is
significantly reduced. When we need to change an item, we need only make a single change in the
dimension table, instead of making many changes in the fact table.
We can create even more complex star schemas by normalizing a dimension table into several tables.
The normalized dimension table is called a Snowflake.
What is Snowflake Schema?
A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one or more
dimension tables do not connect directly to the fact table but must join through other dimension
tables."
The snowflake schema is an expansion of the star schema where each point of the star explodes into
more points. It is called snowflake schema because the diagram of snowflake schema resembles a
snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR schemas. When we
normalize all the dimension tables entirely, the resultant structure resembles a snowflake with the fact
table in the middle.
Snowflaking is used to develop the performance of specific queries. The schema is diagramed with each
fact surrounded by its associated dimensions, and those dimensions are related to other dimensions,
branching out into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension tables, which can be
linked to other dimension tables through a many-to-one relationship. Tables in a snowflake schema are
generally normalized to the third normal form. Each dimension table performs exactly one level in a
hierarchy.
The following diagram shows a snowflake schema with two dimensions, each having three levels. A
snowflake schemas can have any number of dimension, and each dimension can have any number of
levels.
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time, Product,
Line, and Family dimension tables. The Market dimension has two dimension tables with Store as the
primary dimension table, and Location as the outrigger dimension table. The product dimension has
three dimension tables with Product as the primary dimension table, and the Line and Family table are
the outrigger dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This needed more disk
space than a more normalized snowflake schema. Snowflaking normalizes the dimension by moving
attributes with low cardinality into separate dimension tables that relate to the core dimension table by
using foreign keys. Snowflaking for the sole purpose of minimizing disk space is not recommended,
because it can adversely impact query performance.
In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables are
damaged into multiple dimension tables.
Figure shows a simple STAR schema for sales in a manufacturing company. The sales fact table include
quantity, price, and other relevant metrics. SALESREP, CUSTOMER, PRODUCT, and TIME are the
dimension tables.
NOTE:
A snowflake schema is designed for flexible querying across more complex dimensions and relationship.
It is suitable for many to many and one to many relationships between dimension levels.
Advantage of Snowflake Schema

The primary advantage of the snowflake schema is the development in query performance due to
minimized disk storage requirements and joining smaller lookup tables.
It provides greater scalability in the interrelationship between dimension levels and components.
No redundancy, so it is easier to maintain.
Disadvantage of Snowflake Schema

The primary disadvantage of the snowflake schema is the additional maintenance efforts required due
to the increasing number of lookup tables. It is also known as a multi fact star schema.
There are more complex queries and hence, difficult to understand.
More tables more join so more query execution time.

DWH
What is a Data Warehouse?
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than
transaction processing. It includes historical data derived from transaction data from single and multiple
sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of
users.
It is not used for daily operations and transaction processing but used for making decisions.
ATTRIBUTES:
A Data Warehouse can be viewed as a data system with the following attributes:
It is a database designed for investigative tasks, using data from various applications.
It supports a relatively small number of clients with relatively long interactions.
It includes current and historical data to provide a historical perspective of information.
Its usage is read-intensive.
It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of

management's decisions."
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore, data
warehouses typically provide a concise and straightforward view around a particular subject, such as
customer, product, or sales, instead of the global organization's ongoing operations. This is done by
excluding data that are not useful concerning the subject and including all data needed by the users to
understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online
transaction records. It requires performing data cleaning and integration during data warehousing to
ensure consistency in naming conventions, attributes types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6
months, 12 months, or even previous data from a data warehouse. These variations with a transactions
system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update,
insert, and delete operations are not performed. It usually requires only two procedures in data
accessing: Initial loading of data and access to data. Therefore, the DW does not require transaction
processing, recovery, and concurrency capabilities, which allows for substantial speedup of data
retrieval. Non-Volatile defines that once entered into the warehouse, and data should not change.
Need for Data Warehouse

Data Warehouse is needed for the following reasons:
1) Business User: Business users require a data warehouse to view summarized data from the past.
Since these people are non-technical, the data may be presented to them in an elementary form.
2) Store historical data: Data Warehouse is required to store the time variable data from the past. This
input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the data warehouse.
So, data warehouse contributes to making strategic decisions.
4) For data consistency and quality: Bringing the data from different sources at a commonplace, the
user can effectively undertake to bring the uniformity and consistency in data.
5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types of
queries, which demands a significant degree of flexibility and quick response time.
Benefits of Data Warehouse:

Understand business trends and make better forecasting decisions.
Data Warehouses are designed to perform well enormous amounts of data.

The structure of data warehouses is more accessible for end-users to navigate, understand, and query.
Queries that would be complex in many normalized databases could be easier to build and maintain in
data warehouses.
Data warehousing is an efficient method to manage demand for lots of information from lots of users.
Data warehousing provide the capabilities to analyze a large amount of historical data.
Evolution / History of Data Warehouse:

A Data Warehouse (DW) stores corporate information and data from operational systems and a
wide range of other data resources. Data Warehouses are designed to support the decision-
making process through data collection, consolidation, analytics, and research. They can be
used in analyzing a specific subject area, such as “sales,” and are an important part of modern
Business Intelligence. The architecture for Data Warehouses was developed in the 1980s to
assist in transforming data from operational systems to decision-making support systems.
Normally, a Data Warehouse is part of a business’s mainframe server or in the Cloud.
In a Data Warehouse, data from many different sources is brought to a single location and then
translated into a format the Data Warehouse can process and store. For example, a business
stores data about its customer’s information, products, employees and their salaries, sales, and
invoices. The boss may ask about the latest cost-reduction measures, and getting answers will
require an analysis of all of the previously mentioned data. Unlike basic operational data
storage, Data Warehouses contains aggregate historical data (highly useful data taken from a
variety of sources).
Punch cards were the first solution for storing computer generated data. By the 1950s, punch
cards were an important part of the American government and businesses. The warning “Do
not fold, spindle, or mutilate” originally came from punch cards. Punch cards continued to be
used regularly until the mid-1980s. They are still used to record the results of voting ballots and
standardized tests.
“Magnetic storage” slowly replaced punch cards starting in the 1960s. Disk storage came as the
next evolutionary step for data storage. Disk storage (hard drives and floppies) started
becoming popular in 1964 and allowed data to be accessed directly, which was a significant
improvement over the clumsier magnetic tapes.
IBM was primarily responsible for the early evolution of disk storage. They invented the floppy
disk drive as well as the hard disk drive. They are also credited with several of the
improvements now supporting their products. IBM began developing and manufacturing disk
storage devices in 1956. In 2003, they sold their “hard disk” business to Hitachi.
Database Management Systems

Disk storage was quickly followed by software called a Database Management System (DBMS).
In 1966, IBM came up with its own DBMS called, at the time, an Information Management
System. DBMS software was designed to manage “the storage on the disk” and included the
following abilities:
Identify the proper location of data
Resolve conflicts when more than on unit of data is mapped to the same location
Allow data to be deleted
Find room when stored data won’t fit in a specific, limited physical location
Find data quickly (which was the greatest benefit)
Online Applications
In the late 1960s and early ‘70s, commercial online applications came into play, shortly after
disk storage and DBMS software became popular. Once it was realized data could be accessed
directly, information began being shared between computers. As a result, there were a large
number of commercial applications which could be applied to online processing. Some
examples included:
Claims processing
Bank teller processing
Automated teller processing (ATMs)
Airline reservation processing
Retail point of sale processing
Manufacturing control processing
In spite of these improvements, finding specific data could be difficult, and it was not
necessarily trustworthy. The data found might be based on “old” information. At this time, so
much data was being generated by corporations, people couldn’t trust the accuracy of the data
they were using.
Personal Computers and 4GL Technology
In response to this confusion and lack of trust, personal computers became viable solutions.
Personal computer technology let anyone bring their own computer to work and do processing
when convenient. This led to personal computer software, and the realization that the personal
computer’s owner could store their “personal” data on their computer. With this change in
work culture, it was thought a centralized IT department might no longer be needed.
Simultaneously, a technology called 4GL was developed and promoted. 4GL technology
(developed in the 1970s through 1990) was based on the idea that programming and system
development should be straightforward and anyone should be able to do it. This new
technology also prompted the disintegration of centralized IT departments.
4GL technology and personal computers had the effect of freeing the end user, allowing them
to take much more control of the computer system and find information quickly and efficiently.
The goal of freeing end users and allowing them to access their own data was a very popular
step forward. Personal computers and 4GL quickly gained popularity in the corporate
environment. But along the way, something unexpected happened. End users discovered that:
Incorrect data can be misleading.
Incomplete data may not be very useful.
Old data is not desirable.
Multiple versions of the same data can be confusing.
Data lacking documentation is questionable.
Relational Databases
Relational databases became popular in the 1980s. Relational databases were significantly
more user-friendly than their predecessors. Structured Query Language (SQL) is the language
used by relational database management systems (RDBMS). By the late 1980s, a large number
of businesses had moved from mainframe computers on to client servers. Staff members were
now assigned a personal computer, and office applications (Excel, Microsoft Word, and Access)
started gaining favor.
The Need for Data Warehouses
During the 1990s major cultural and technological changes were taking place. The internet was
surging in popularity. Competition had increased due to new free trade agreements,
computerization, globalization, and networking. This new reality required greater business
intelligence, resulting in the need for true data warehousing. During this time, the use of
application systems exploded.
By the year 2000, many businesses discovered that, with the expansion of databases and
application systems, their systems had been badly integrated and that their data was
inconsistent. They discovered they were receiving and storing lots of fragmented data.
Somehow, the data needed to be integrated to provide the critical “Business Information”
needed for decision-making in a competitive, constantly-changing global economy.
Data Warehouses were developed by businesses to consolidate the data they were taking from
a variety of databases, and to help support their strategic decision-making efforts.
The Use of NoSQL

As Data Warehouses came into being, an accumulation of Big Data began to develop. This
accumulation required the development of computers, smart phones, the Internet, and the
Internet of Things to provide the data. Credit cards have also played a role, as has social media.
Facebook began using a NoSQL system in 2008. NoSQL is a “non-relational” Database

Management System that uses fairly simple architecture. It is quite useful when processing Big
Data.
Data Warehouse Alternatives
Data Silos can be a natural occurrence in large organizations, with each department having
different goals, responsibilities, and priorities. Data silos are storage areas of fixed data which
are under the control of a single department and have been separated and isolated from access
by other departments for privacy and security. Data silos can also happen when departments
compete instead of working together towards common goals. They are generally considered a
hindrance to collaboration and efficient business practices.
A Data Mart is an area for storing data that serves a particular community or group of workers.
They are storage areas with fixed data and deliberately under the control of one department
within the organization.
Data Swamps can be the result of a poorly designed or neglected Data Lake. A Data Swamp
describes the failures to document stored data correctly. This situation makes the data difficult
to analyze and use efficiently. While the original data may still be there, a Data Swamp cannot
recover it without the appropriate metadata for context.
A Data Cube is software that stores data in matrices of three or more dimensions. Any
transformations in the data are expressed as tables and arrays of processed information. After
tables have matched the rows of data strings with the columns of data types, the data cube
then cross-references tables from a single data source or multiple data sources, increasing the
detail of each data point. This arrangement provides researchers with the ability to find deeper
insights than other techniques.
Data Lakes use a more flexible structure for data on the way in than a Data Warehouse. Data is
organized to fit the lake’s database schema, and they use a more fluid approach in storing it.
Data Lakes only add structure to data as it moves to the application layer. Data Lakes preserve
the original structure of data and can be used as a storage and retrieval system for Big Data,
which could, theoretically, scale upward indefinitely.
OPERATIONAL DATABASE VS DATA WAREHOUSE
1:
Operational systems are designed to support high-volume transaction processing.
Data warehousing systems are typically designed to support high-volume analytical processing
(i.e., OLAP).
2:
Operational systems are usually concerned with current data.
Data warehousing systems are usually concerned with historical data.
3:
Data within operational systems are mainly updated regularly according to need.
Non-volatile, new data may be added regularly. Once Added rarely changed.
4:
It is designed for real-time business dealing and processes.
It is designed for analysis of business measures by subject area, categories, and attributes.
5:
It is optimized for a simple set of transactions, generally adding or retrieving a single row at a
time per table.It is optimized for extent loads and high, complex, unpredictable queries that
access many rows per table.
It is optimized for validation of incoming information during transactions, uses validation data
tables. Loaded with consistent, valid information, requires no real-time validation.
Approaches
A data-warehouse is a heterogeneous collection of different data sources organised under a unified
schema. There are 2 approaches for constructing data-warehouse: Top-down approach and Bottom-up
approach are explained as below.
External Sources –
External source is a source from where data is collected irrespective of the type of data. Data can be
structured, semi structured and unstructured as well.
Stage Area –
Since the data, extracted from the external sources does not follow a particular format, so there is a
need to validate this data to load into datawarehouse. For this purpose, it is recommended to use ETL
tool.
E(Extracted): Data is extracted from External data source.
T(Transform): Data is transformed into the standard format.
L(Load): Data is loaded into datawarehouse after transforming it into the standard format.
Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as central repository. It actually stores the
meta data and the actual data gets stored in the data marts. Note that datawarehouse stores the data in
its purest form in this top-down approach.
Data Marts –
Data mart is also a part of storage component. It stores the information of a particular function of an
organisation which is handled by single authority. There can be as many number of data marts in an
organisation depending upon the functions. We can also say that data mart contains subset of the data
stored in datawarehouse.
Data Mining –
The practice of analysing the big data present in datawarehouse is data mining. It is used to find the
hidden patterns that are present in the database or in datawarehouse with the help of algorithm of data
mining.
This approach is defined by Inmon as – datawarehouse as a central repository for the complete
organisation and data marts are created from it after the complete datawarehouse has been created.
Advantages of Top-Down Approach –
Since the data marts are created from the datawarehouse, provides consistent dimensional view of data
marts.
Also, this model is considered as the strongest model for business changes. That’s why, big organisations
prefer to follow this approach.
Creating data mart from datawarehouse is easy.
Disadvantages of Top-Down Approach –
The cost, time taken in designing and its maintainence is very high.
1:First, the data is extracted from external sources (same as happens in top-down approach).
2:Then, the data go through the staging area (as explained above) and loaded into data marts instead of
datawarehouse. The data marts are created first and provide reporting capability. It addresses a single
business area.
3:These data marts are then integrated into datawarehouse.
This approach is given by Kinball as – data marts are created first and provides a thin view for analyses
and datawarehouse is created after complete data marts have been created.
Advantages of Bottom-Up Approach –
As the data marts are created first, so the reports are quickly generated.
We can accommodate more number of data marts here and in this way datawarehouse can be
extended.
Also, the cost and time taken in designing this model is low comparatively.
Disadvantage of Bottom-Up Approach –
This model is not strong as top-down approach as dimensional view of data marts is not consistent as it
is in above approach.

Mid Syllabus DWH

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mid Syllabus DWH

Uploaded by

Copyright:

Available Formats

Components or building blocks of DWH

Source Data Component

Data staging component

Data Storage Components

Information Delivery Components

Management and Control Components

Why we need a separate Data Warehouse?

Database VS Data Warehouse

3. Data is dynamic 3. Data is largely static

5. Optimized for write operations. 5. Optimized for read operations.

What is Dimensional Modeling?

Objectives of Dimensional Modeling

The purposes of dimensional modeling are:

Advantages of Dimensional Modeling

Disadvantages of Dimensional Modeling

Elements of Dimensional Modeling

Characteristics of the Fact table

Fact tables typically include a small number of columns.

Time ID Product ID Customer ID Unit Sold

Customer ID Name Gender Income Education Region

What is Star Schema?

Characteristics of Star Schema

It creates a DE-normalized database that can quickly provide query responses.

It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema

The main advantage of star schemas in a decision-support environment are:

Load performance and administration

Built-in referential integrity

Disadvantage of Star Schema

Advantage of Snowflake Schema

No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema

There are more complex queries and hence, difficult to understand.

More tables more join so more query execution time.

It supports a relatively small number of clients with relatively long interactions.

It includes current and historical data to provide a historical perspective of information.

Its usage is read-intensive.

It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of

Need for Data Warehouse

Benefits of Data Warehouse:

Data Warehouses are designed to perform well enormous amounts of data.

Evolution / History of Data Warehouse:

Database Management Systems

Identify the proper location of data

Allow data to be deleted

Find data quickly (which was the greatest benefit)

Bank teller processing

Automated teller processing (ATMs)

Airline reservation processing

Retail point of sale processing

Manufacturing control processing

Personal Computers and 4GL Technology

Incorrect data can be misleading.

Incomplete data may not be very useful.

Old data is not desirable.

Multiple versions of the same data can be confusing.

Data lacking documentation is questionable.

The Need for Data Warehouses

The Use of NoSQL

Facebook began using a NoSQL system in 2008. NoSQL is a “non-relational” Database

Data Warehouse Alternatives