You are on page 1of 19

What Is a Data Warehouse

A data warehouse is a centralized storage system that allows for the storing, analyzing, and interpreting
of data in order to facilitate better decision-making. Transactional systems, relational databases, and
other sources provide data into data warehouses on a regular basis.

A data warehouse is a type of data management system that facilitates and supports business


intelligence (BI) activities, specifically analysis. Data warehouses are primarily designed to facilitate
searches and analyses and usually contain large amounts of historical data.

A data warehouse can be defined as a collection of organizational data and information extracted from
operational sources and external data sources. The data is periodically pulled from various internal
applications like sales, marketing, and finance; customer-interface applications; as well as external
partner systems. This data is then made available for decision-makers to access and analyze. 

So what is data warehouse? For a start, it is a comprehensive repository of current and historical
iKey Characteristics of Data Warehouse

The main characteristics of a data warehouse are as follows:

 Subject-Oriented

A data warehouse is subject-oriented since it provides topic-wise information rather than the overall
processes of a business. Such subjects may be sales, promotion, inventory, etc. For example, if you want
to analyze your company’s sales data, you need to build a data warehouse that concentrates on sales.
Such a warehouse would provide valuable information like ‘who was your best customer last year?’ or
‘who is likely to be your best customer in the coming year?’

 Integrated

A data warehouse is developed by integrating data from varied sources into a consistent format. The data
must be stored in the warehouse in a consistent and universally acceptable manner in terms of naming,
format, and coding. This facilitates effective data analysis. 

 Non-Volatile

Data once entered into a data warehouse must remain unchanged. All data is read-only. Previous data is
not erased when current data is entered. This helps you to analyze what has happened and when. 

 Time-Variant
The data stored in a data warehouse is documented with an element of time, either explicitly or implicitly.
An example of time variance in Data Warehouse is exhibited in the Primary Key, which must have an
element of time like the day, week, or month.

information that is designed to enhance an organization’s performance. 

Database vs. Data Warehouse

Although a data warehouse and a traditional database share some similarities, they need not be the same
idea. The main difference is that in a database, data is collected for multiple transactional purposes.
However, in a data warehouse, data is collected on an extensive scale to perform analytics. Databases
provide real-time data, while warehouses store data to be accessed for big analytical queries. 

Data warehouse is an example of an OLAP system or an online database query answering system. OLTP
is an online database modifying system, for example, ATM. Learn more about the OLTP vs.
OLAP differences.

What is Structured Data?

Structured data is information that has been formatted and transformed into a well-defined
data model. The raw data is mapped into predesigned fields that can then be extracted and
read through SQL easily. SQL relational databases, consisting of tables with rows and columns,
are the perfect example of structured data.

The relational model of this data format utilizes memory since it minimizes data redundancy.
However, this also means that structured data is more inter-dependent and less flexible. Now
let’s look at more examples of structured data.

Examples of Structured Data

This type of data is generated by both humans and machines.  There are numerous examples
of structured data generated by machines, such as POS data like quantity, barcodes, and
weblog statistics. Similarly, anyone who works on data would have used spreadsheets once in
their lifetime, which is a classic case of structured data generated by humans. Due to the
organization of structured data, it is easier to analyze than both semi-structured and
unstructured data.

What is Semi-Structured Data?

Your data sets may not always be structured or unstructured; semi-structured data or partially
structured data is another category between structured and unstructured data. Semi-structured
data is a type of data that has some consistent and definite characteristics. It does not confine into
a rigid structure such as that needed for relational databases. Organizational properties like
metadata or semantics tags are used with semi-structured data to make it more manageable;
however, it still contains some variability and inconsistency.

Examples of Semi-Structured Data

An example of data semi-structured format is delimited files. It contains elements that can break
down the data into separate hierarchies. Similarly, in digital photographs, the image does not
have a pre-defined structure itself but has certain structural attributes making them semi-
structured. For instance, if an image is taken from a smartphone, it would have some structured
attributes like geotag, device ID, and DateTime stamp. After being stored, images can also be
assigned tags such as ‘pet’ or ‘dog’ to provide a structure.

On some occasions, unstructured data is classified as semi-structured data because it has one or
more classifying attributes.

What is Unstructured Data?

Unstructured data is defined as data present in absolute raw form. This data is difficult to
process due to its complex arrangement and formatting. Unstructured data management may
take data from many forms, including social media posts, chats, satellite imagery, IoT sensor
data, emails, and presentations, to organize it in a logical, predefined manner in a data storage.
In contrast, the meaning of structured data is data that follows predefined data models and is
easy to analyze. Structured data examples would include alphabetically arranged names of
customers and properly organized credit card numbers. After understanding the definition of
unstructured data, let’s look at some examples.

Examples of Unstructured Data

Unstructured data can be anything that’s not in a specific format. This can be a paragraph from
a book with relevant information or a web page. An example of unstructured data could also be
Log files that are not easy to separate. Social media comments and posts need to be analyzed.
38,P-R-38636-6-45,P-R-39105-1-11,P-R-38036-1-5,P-R-35697-1-13,P-R-35087-1-
27,P-R-34341-1-9,P-R-33341-1-15,P-R-33110-1-29,P-R-31345-1-693,P-R-29076-1-
6,P-R-28767-1-8,P-R-28540-2-8,P-R-28312-1-10,P-R-28069-1-27,P-R-28032-1-9,P-R-
26562-1-12,P-R-26527-5-20,P-R-26164-1-11,P-R-25785-1-30,P-R-25095-9-70,P-R-
23504-1-15,P-R-19719-5-41203
Wed Sep 23 2020 05:21:01 GMT+0500

Goals of a Data Warehouse


Make an organization's information easily accessible 
The contents of the data warehouse must be understandable and be intuitive and obvious to
the business user. The contents of the data warehouse need to be labeled meaningfully. The
tools that access the data warehouse must be simple and easy to use. They also must
return query results to the user with minimal wait times. 
Present the organization's information consistently
Consistent information means high-quality information. It means that all the data is
accounted for and complete. Consistency also implies that common definitions for the
contents of the data warehouse are available for users. 
Be adaptive and resilient to change
We simply can't avoid change. User needs, business conditions, data, and technology are all
subject to the shifting sands of time. The data warehouse must be designed to handle this
inevitable change.
 
Be a secure bastion that protects our information assets
The data warehouse must effectively control access to the organization's confidential
information. 
serve as the foundation for improved decision making 
The data warehouse must have the right data in it to support decision making.

Key Characteristics of Data Warehouse


The main characteristics of a data warehouse are as follows:

 Subject-Oriented

A data warehouse is subject-oriented since it provides topic-wise information rather than the overall
processes of a business. Such subjects may be sales, promotion, inventory, etc. For example, if you want
to analyze your company’s sales data, you need to build a data warehouse that concentrates on sales.
Such a warehouse would provide valuable information like ‘who was your best customer last year?’ or
‘who is likely to be your best customer in the coming year?’

 Integrated

A data warehouse is developed by integrating data from varied sources into a consistent format. The data
must be stored in the warehouse in a consistent and universally acceptable manner in terms of naming,
format, and coding. This facilitates effective data analysis. 

 Non-Volatile

Data once entered into a data warehouse must remain unchanged. All data is read-only. Previous data is
not erased when current data is entered. This helps you to analyze what has happened and when. 
 Time-Variant

The data stored in a data warehouse is documented with an element of time, either explicitly or implicitly.
An example of time variance in Data Warehouse is exhibited in the Primary Key, which must have an
element of time like the day, week, or month.

Database vs. Data Warehouse

Although a data warehouse and a traditional database share some similarities, they need not be the same
idea. The main difference is that in a database, data is collected for multiple transactional purposes.
However, in a data warehouse, data is collected on an extensive scale to perform analytics. Databases
provide real-time data, while warehouses store data to be accessed for big analytical queries. 

Data warehouse is an example of an OLAP system or an online database query answering system. OLTP
is an online database modifying system, for example, ATM. Learn more about the OLTP vs.
OLAP differences.

Three common architectures are:

o Data Warehouse Architecture: Basic


o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic


Operational System

An operational system is a method used in data warehousing to refer to a system that is used to process
the day-to-day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and every file in the system
must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Meta Data summarizes necessary information about data, which can make finding and work with
particular instances of data more accessible. For example, author, data build, and data changed, and file
size are examples of very basic document metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated) data
generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized record is
updated continuously as new information is loaded into the warehouse.

What Is ETL Process In Data Warehouse?


We all know that Data warehouse is a collection of huge volumes of data, to provide information to the
business users with the help of Business Intelligence tools.

To serve this purpose DW should be loaded at regular intervals. The data into the system is gathered from one
or more operational systems, flat files, etc. The process which brings the data to DW is known as ETL
Process. Extraction, Transformation, and Loading are the tasks of ETL.

#1) Extraction: All the preferred data from various source systems such as databases, applications, and flat
files is identified and extracted. Data extraction can be completed by running jobs during non-business hours.

#2) Transformation: Most of the extracted data can’t be directly loaded into the target system. Based on the
business rules, some transformations can be done before loading the data.

For Example, a target column of data may expect two source columns of concatenated data as input.
Likewise, there may be complex logic for data transformation that needs expertise. Some data that does not
need any transformations can be directly moved to the target system.
The transformation process also corrects the data, removes any incorrect data and fixes any errors in the data
before loading it.

#3) Loading: All the gathered information is loaded into the target Data Warehouse tables.

Extraction Flow Diagram:


State about the time window to run the jobs to each source system in advance, so that no source data would
be missed during the extraction cycle.

With the above steps, extraction achieves the goal of converting data from different formats from different
sources into a single DW format, that benefits the whole ETL processes. Such logically placed data is more
useful for better analysis.

Extraction Methods In Data Warehouse


Depending on the source and target data environments and the business needs, you can select the extraction
method suitable for your DW.

#1) Logical Extraction Methods


Data extraction in a Data warehouse system can be a one-time full load that is done initially (or) it can be
incremental loads that occur every time with constant updates.

 Full Extraction: As the name itself suggests, the source system data is completely extracted to the
target table. Each time this kind of extraction loads the entire current source system data without
considering the last extracted time stamps. Preferably you can use full extraction for the initial loads or
tables with fewer data.
 Incremental Extraction: The data which is added/modified from a specific date will be considered for
incremental extraction. This date is business-specific as last extracted date (or) last order date etc. We
can refer to a timestamp column from the source table itself (or) a separate table can be created to
track only the extraction date details. Referring to the timestamp is a significant method during
Incremental extraction. Logics without timestamp may fail if the DW table has large data.
#2) Physical Extraction Methods
Depending on the source systems’ capabilities and the limitations of data, the source systems can provide the
data physically for extraction as online extraction and offline extraction. This supports any of the logical
extraction types.

 Online Extraction:: We can directly connect to any source system databases with the connection
strings to extract data directly from the source system tables.
 Offline Extraction:: We will not directly connect to the source system database here, instead the
source system provides data explicitly in a pre-defined structure. Source systems can provide data in
the form of Flat files, Dump files, Archive logs and Tablespaces.
ETL tools are best suited to perform any complex data extractions, any number of times for DW though they
are expensive.

Extracting Changed Data


Once the initial load is completed, it is important to consider how to extract the data that is changed from the
source system further. The ETL Process team should design a plan on how to implement extraction for the
initial loads and the incremental loads, at the beginning of the project itself.

Mostly you can consider the “Audit columns” strategy for the incremental load to capture the data changes. In
general, the source system tables may contain audit columns, that store the time stamp for each insertion (or)
modification.

The timestamp may get populated by database triggers (or) from the application itself. You must ensure the
accuracy of the audit columns’ data even if they are loading by any means, to not to miss the changed data for
incremental loads.

During the incremental load, you can consider the maximum date and time of when the last load has happened
and extract all the data from the source system with the time stamp greater than the last load time stamp.

While Extracting the Data:


 Use queries optimally to retrieve only the data that you need.
 Do not use the Distinct clause much as it slows down the performance of the queries.
 Use SET operators such as Union, Minus, Intersect carefully as it degrades the performance.
 Use comparison key words such as like, between, etc in where clause, rather than functions such as
substr(), to_char(), etc.
Data Transformation
Transformation is the process where a set of rules is applied to the extracted data before directly loading the
source system data to the target system. The extracted data is considered as raw data.

The transformation process with a set of standards brings all dissimilar data from various source systems into
usable data in the DW system. Data transformation aims at the quality of the data. You can refer to the data
mapping document for all the logical transformation rules.

Based on the transformation rules if any source data is not meeting the instructions, then such source data is
rejected before loading into the target DW system and is placed into a reject file or reject table.

The transformation rules are not specified for the straight load columns data (does not need any change) from
source to target. Hence, data transformations can be classified as simple and complex. Data transformations
may involve column conversions, data structure reformatting, etc.
Given below are some of the tasks to be performed during Data Transformation:
#1) Selection: You can select either the entire table data or a specific set of columns data from the source
systems. The selection of data is usually completed at the Extraction itself.
There may be cases where the source system does not allow to select a specific set of columns data during the
extraction phase, then extract the whole data and do the selection in the transformation phase.

#2) Splitting/joining: You can manipulate the selected data by splitting or joining it. You will be asked to split
the selected source data even more during the transformation.
For example, if the whole address is stored in a single large text field in the source system, the DW system
may ask to split the address into separate fields as a city, state, zip code, etc. This is easy for indexing and
analysis based on each component individually.
Whereas joining/merging two or more columns data is widely used during the transformation phase in the DW
system. This does not mean merging two fields into a single field.

For Example, if information about a particular entity is coming from multiple data sources, then gathering the
information as a single entity can be called as joining/merging the data.
#3) Conversion: The extracted source systems data could be in different formats for each data type, hence all
the extracted data should be converted into a standardized format during the transformation phase. The same
kind of format is easy to understand and easy to use for business decisions.
#4) Summarization: In some situations, DW will look for summarized data rather than low-level detailed data
from the source systems. Because low-level data is not best suited for analysis and querying by the business
users.
For example, sales data for every checkout may not be required by the DW system, daily sales by-product (or)
daily sales by the store is useful. Hence summarization of data can be performed during the transformation
phase as per the business requirements.
#5) Enrichment: When a DW column is formed by combining one or more columns from multiple records, then
data enrichment will re-arrange the fields for a better view of data in the DW system.
#6) Format revisions: Format revisions happen most frequently during the transformation phase. The data
type and its length are revised for each column.
For example, a column in one source system may be numeric and the same column in another source system
may be a text. To standardize this, during the transformation phase the data type for this column is changed to
text.
#7) Decoding of fields: When you are extracting data from multiple source systems, the data in various
systems may be decoded differently.
For example, one source system may represent customer status as AC, IN, and SU. Another system may
represent the same status as 1, 0 and -1.
During the data transformation phase, you need to decode such codes into proper values that are
understandable by the business users. Hence, the above codes can be changed to Active, Inactive and
Suspended.

#8) Calculated and derived values: By considering the source system data, DW can store additional column
data for the calculations. You have to do the calculations based on the business logic before storing it into DW.
#9) Date/Time conversion: This is one of the key data types to concentrate on. The date/time format may be
different in multiple source systems.
For example, one source may store the date as November 10, 1997. Another source may store the same date
in 11/10/1997 format. Hence, during the data transformation, all the date/time values should be converted into
a standard format.
#10) De-duplication: In case the source system has duplicate records, then ensure that only one record is
loaded to the DW system.
Transformation Flow Diagram:
How To Implement Transformation?
Depending on the complexity of data transformations you can use manual methods, transformation tools (or)
combination of both whichever is effective.

#1) Manual Techniques


Manual techniques are adequate for small DW systems. Data analysts and developers will create the programs
and scripts to transform the data manually. This method needs detailed testing for every portion of the code.

The maintenance cost may become high due to the changes that occur in business rules (or) due to the
chances of getting errors with the increase in the volumes of data. You should take care of metadata initially
and also with every change that occurs in the transformation rules.

#2) Transformation Tools


If you want to automate most of the transformation process, then you can adopt the transformation tools
depending on the budget and time frame available for the project. While automating you should spend good
quality time to select the tools, configure, install and integrate them with the DW system.

Practically Complete transformation with the tools itself is not possible without manual intervention. But the data
transformed by the tools is certainly efficient and accurate.
To achieve this, we should enter proper parameters, data definitions, and rules to the transformation tool as
input. From the inputs given, the tool itself will record the metadata and this metadata gets added to the overall
DW metadata.

If there are any changes in the business rules, then just enter those changes to the tool, the rest of the
transformation modifications will be taken care of by the tool itself. Hence a combination of both methods is
efficient to use.

Data Loading
Extracted and transformed data gets loaded into the target DW tables during the Load phase of the ETL
process. The business decides how the loading process should happen for each table.

The loading process can happen in the below ways:


 Initial load: Loading the data to populate the respective DW tables for the first time.
 Incremental load: Once the DW tables are loaded, the rest of the ongoing changes are applied
periodically.
 Full refresh: If any tables that are in use need a refresh, then the current data from that table is
completely removed and then reloaded. Reloading is similar to the initial load.

Look at the below example, for a better understanding of the loading process in ETL:

Product ID Product Name Sold Date

1 Grammar book 3rd June 2007

2 Marker 3rd June 2007

3 Back bag 4th June 2007

4 Cap 4th June 2007

5 Shoes 5th June 2007

#1) During the initial load, the data which is sold on 3rd June 2007 gets loaded into the DW target table because
it is the initial data from the above table.
#2) During the Incremental load, we need to load the data which is sold after 3 rd June 2007. We should consider
all the records with the sold date greater than (>) the previous date for the next day. Hence, on 4th June 2007,
fetch all the records with sold date > 3rd June 2007 by using queries and load only those two records from the
above table.
On 5th June 2007, fetch all the records with sold date > 4th June 2007 and load only one record from the above
table.
#3) During Full refresh, all the above table data gets loaded into the DW tables at a time irrespective of the sold
date.
The loaded data is stored in the respective dimension (or) fact tables. The data can be loaded,
appended, or merged to the DW tables as follows:
#4) Load: The data gets loaded into the target table if it is empty. If the table has some data exist, the existing
data is removed and then gets loaded with the new data.
For example,
Existing Table Data
Employee Name Role

John Manager

Revanth Lead

Bob Assistant manager

Ronald Developer

Changed Data
Employee Name Role

John Manager

Rohan Director

Chetan AVP

Das VP

Data After Loading


Employee Name Role

John Manager

Rohan Director

Chetan AVP

Das VP

#5) Append: Append is an extension of the above load as it works on already data existing tables. In the target
tables, Append adds more data to the existing data. If any duplicate record is found with the input data, then it
may be appended as duplicate (or) it may be rejected.
What is a Star Schema?
Star Schema in data warehouse, in which the center of the star can have one fact table and
a number of associated dimension tables. It is known as star schema as its structure
resembles a star. The Star Schema data model is the simplest type of Data Warehouse
schema. It is also known as Star Join Schema and is optimized for querying large data sets.

In the following Star Schema example, the fact table is at the center which contains keys to
every dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID & other
attributes like Units sold and revenue.

Example of
Star Schema Diagram

Characteristics of Star Schema:


 Every dimension in a star schema is represented with the only one-dimension table.
 The dimension table should contain the set of attributes.
 The dimension table is joined to the fact table using a foreign key
 The dimension table are not joined to each other
 Fact table would contain key and measure
 The Star schema is easy to understand and provides optimal disk usage.
 The dimension tables are not normalized. For instance, in the above figure,
Country_ID does not have Country lookup table as an OLTP design would have.
 The schema is widely supported by BI Tools

What is a Snowflake Schema?


Snowflake Schema in data warehouse is a logical arrangement of tables in a
multidimensional database such that the ER diagram resembles a snowflake shape. A
Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions.
The dimension tables are normalized which splits data into additional tables.

In the following Snowflake Schema example, Country is further normalized into an individual
table.

Example of Snowflake Schema

Characteristics of Snowflake Schema:


 The main benefit of the snowflake schema it uses smaller disk space.
 Easier to implement a dimension is added to the Schema
 Due to multiple tables query performance is reduced
 The primary challenge that you will face while using the snowflake Schema is that
you need to perform more maintenance efforts because of the more lookup tables.

Star Schema Vs Snowflake Schema: Key Differences


Following is a key difference between Snowflake schema vs Star schema:

Star Schema Snowflake Schema


Hierarchies for the dimensions are stored in the
Hierarchies are divided into separate tables.
dimensional table.

One fact table surrounded by dimension table which are in


It contains a fact table surrounded by dimension tables.
turn surrounded by dimension table

In a star schema, only single join creates the relationship


A snowflake schema requires many joins to fetch the data.
between the fact table and any dimension tables.

Simple DB Design. Very Complex DB Design.

Denormalized Data structure and query also run faster. Normalized Data Structure.

High level of Data redundancy Very low-level data redundancy

Single Dimension table contains aggregated data. Data Split into different Dimension Tables.

Cube processing might be slow because of the complex


Cube processing is faster.
join.

Offers higher performing queries using Star Join Query


The Snowflake schema is represented by centralized fact
Optimization.
table which unlikely connected with multiple dimensions.
Tables may be connected with multiple dimensions.

What is a Galaxy Schema?


A Galaxy Schema contains two fact table that share dimension tables between them. It is
also called Fact Constellation Schema. The schema is viewed as a collection of stars hence
the name Galaxy Schema.
Example of Galaxy Schema

As you can see in above example, there are two facts table

1. Revenue
2. Product.

In Galaxy schema shares dimensions are called Conformed Dimensions.


Characteristics of Galaxy Schema:
 The dimensions in this schema are separated into separate dimensions based on
the various levels of hierarchy.
 For example, if geography has four levels of hierarchy like region, country, state, and
city then Galaxy schema should have four dimensions.
 Moreover, it is possible to build this type of schema by splitting the one-star schema
into more Star schemes.
 The dimensions are large in this schema which is needed to build based on the
levels of hierarchy.
 This schema is helpful for aggregating fact tables for better understanding.

Horizontal partitioning (sharding)

Figure 1 shows horizontal partitioning or sharding. In this example, product inventory data is divided into
shards based on the product key. Each shard holds the data for a contiguous range of shard keys (A-G
and H-Z), organized alphabetically. Sharding spreads the load over more computers, which reduces
contention and improves performance.
Figure 1 - Horizontally partitioning (sharding) data based on a partition key.

Vertical partitioning

The most common use for vertical partitioning is to reduce the I/O and performance costs associated with
fetching items that are frequently accessed. Figure 2 shows an example of vertical partitioning. In this
example, different properties of an item are stored in different partitions. One partition holds data that is
accessed more frequently, including product name, description, and price. Another partition holds
inventory data: the stock count and last-ordered date.

Figure 2 - Vertically partitioning data by its pattern of use.

You might also like