You are on page 1of 14

WEEK 16-17

Dimensional Model, DWH Indexes, Data


Integration
What is Dimensional Modeling in Data Warehouse?
Dimensional Modeling
Dimensional Modeling (DM) is a data structure technique optimized for data storage in a Data
warehouse. The purpose of dimensional modeling is to optimize the database for faster retrieval
of data. The concept of Dimensional Modelling was developed by Ralph Kimball and consists of
“fact” and “dimension” tables.
A dimensional model in data warehouse is designed to read, summarize, analyze numeric
information like values, balances, counts, weights, etc. in a data warehouse. In contrast, relation
models are optimized for addition, updating and deletion of data in a real-time Online
Transaction System.
These dimensional and relational models have their unique way of data storage that has
specific advantages.
For instance, in the relational mode, normalization and ER models reduce redundancy in data.
On the contrary, dimensional model in data warehouse arranges data in such a way that it is
easier to retrieve information and generate reports.
Hence, Dimensional models are used in data warehouse systems and not a good fit for
relational systems.
Elements of Dimensional Data Model
Fact
Facts are the measurements/metrics or facts from your business process. For a Sales business
process, a measurement would be quarterly sales number
Dimension
Dimension provides the context surrounding a business process event. In simple terms, they
give who, what, where of a fact. In the Sales business process, for the fact quarterly sales
number, dimensions would be

 Who – Customer Names


 Where – Location
 What – Product Name
In other words, a dimension is a window to view information in the facts.
Attributes
The Attributes are the various characteristics of the dimension in dimensional data modeling.
In the Location dimension, the attributes can be

 State
 Country
 Zipcode etc.
Attributes are used to search, filter, or classify facts. Dimension Tables contain Attributes

Fact Table
A fact table is a primary table in dimension modelling.
A Fact Table contains
1. Measurements/facts
2. Foreign key to dimension table
Dimension Table

 A dimension table contains dimensions of a fact.


 They are joined to fact table via a foreign key.
 Dimension tables are de-normalized tables.
 The Dimension Attributes are the various columns in a dimension table
 Dimensions offers descriptive characteristics of the facts with the help of their attributes
 No set limit set for given for number of dimensions
 The dimension can also contain one or more hierarchical relationships
Types of Dimensions in Data Warehouse
Following are the Types of Dimensions in Data Warehouse:

 Conformed Dimension
 Outrigger Dimension
 Shrunken Dimension
 Role-playing Dimension
 Dimension to Dimension Table
 Junk Dimension
 Degenerate Dimension
 Swappable Dimension
 Step Dimension
Steps of Dimensional Modelling
The accuracy in creating your Dimensional modeling determines the success of your data
warehouse implementation. Here are the steps to create Dimension Model

1. Identify Business Process


2. Identify Grain (level of detail)
3. Identify Dimensions
4. Identify Facts
5. Build Star
The model should describe the Why, How much, When/Where/Who and What of your business
process

Step 1) Identify the Business Process

Identifying the actual business process a datawarehouse should cover. This could be Marketing, Sales,
HR, etc. as per the data analysis needs of the organization. The selection of the Business process also
depends on the quality of data available for that process. It is the most important step of the Data
Modelling process, and a failure here would have cascading and irreparable defects. To describe the
business process, you can use plain text or use basic Business Process Modelling Notation (BPMN) or
Unified Modelling Language (UML).

Step 2) Identify the Grain

The Grain describes the level of detail for the business problem/solution. It is the process of identifying
the lowest level of information for any table in your data warehouse. If a table contains sales data for
every day, then it should be daily granularity. If a table contains total sales data for each month, then it
has monthly granularity. During this stage, you answer questions like

 Do we need to store all the available products or just a few types of products? This decision is
based on the business processes selected for Datawarehouse
 Do we store the product sale information on a monthly, weekly, daily or hourly basis? This
decision depends on the nature of reports requested by executives
 How do the above two choices affect the database size?

Example of Grain:

The CEO at an MNC wants to find the sales for specific products in different locations on a daily basis.

So, the grain is "product sale information by location by the day."

Step 3) Identify the Dimensions

Dimensions are nouns like date, store, inventory, etc. These dimensions are where all the data should be
stored. For example, the date dimension may contain data like a year, month and weekday.

Example of Dimensions:

The CEO at an MNC wants to find the sales for specific products in different locations on a daily basis.

Dimensions: Product, Location and Time

Attributes: For Product: Product key (Foreign Key), Name, Type, Specifications

Hierarchies: For Location: Country, State, City, Street Address, Name

Step 4) Identify the Fact

This step is co-associated with the business users of the system because this is where they get access to
data stored in the data warehouse. Most of the fact table rows are numerical values like price or cost
per unit, etc.

Example of Facts:

The CEO at an MNC wants to find the sales for specific products in different locations on a daily basis.

The fact here is Sum of Sales by product by location by time.

Step 5) Build Schema

In this step, you implement the Dimension Model. A schema is nothing but the database structure
(arrangement of tables). There are two popular schemas
1. Star Schema

The star schema architecture is easy to design. It is called a star schema because diagram resembles a
star, with points radiating from a center. The center of the star consists of the fact table, and the points
of the star is dimension tables.

The fact tables in a star schema which is third normal form whereas dimensional tables are de-
normalized.

2. Snowflake Schema

The snowflake schema is an extension of the star schema. In a snowflake schema, each dimension are
normalized and connected to more dimension tables.

Rules for Dimensional Modelling

Following are the rules and principles of Dimensional Modeling:

 Load atomic data into dimensional structures.


 Build dimensional models around business processes.
 Need to ensure that every fact table has an associated date dimension table.
 Ensure that all facts in a single fact table are at the same grain or level of detail.
 It's essential to store report labels and filter domain values in dimension tables
 Need to ensure that dimension tables use a surrogate key
 Continuously balance requirements and realities to deliver business solution to support their
decision-making

Benefits of Dimensional Modeling

 Standardization of dimensions allows easy reporting across areas of the business.


 Dimension tables store the history of the dimensional information.
 It allows to introduce entirely new dimension without major disruptions to the fact table.
 Dimensional also to store data in such a fashion that it is easier to retrieve the information from
the data once the data is stored in the database.
 Compared to the normalized model dimensional table are easier to understand.
 Information is grouped into clear and simple business categories.
 The dimensional model is very understandable by the business. This model is based on business
terms, so that the business knows what each fact, dimension, or attribute means.
 Dimensional models are deformalized and optimized for fast data querying. Many relational
database platforms recognize this model and optimize query execution plans to aid in
performance.
 Dimensional modelling in data warehouse creates a schema which is optimized for high
performance. It means fewer joins and helps with minimized data redundancy.
 The dimensional model also helps to boost query performance. It is more denormalized
therefore it is optimized for querying.
 Dimensional models can comfortably accommodate change. Dimension tables can have more
columns added to them without affecting existing business intelligence applications using these
tables.
What is Multi-Dimensional Data Model in Data Warehouse?

Multidimensional data model in data warehouse is a model which represents data in the form of data
cubes. It allows to model and view the data in multiple dimensions and it is defined by dimensions and
facts. Multidimensional data model is generally categorized around a central theme and represented by
a fact table.

Summary:

 A dimensional model is a data structure technique optimized for Data warehousing tools.
 Facts are the measurements/metrics or facts from your business process.
 Dimension provides the context surrounding a business process event.
 Attributes are the various characteristics of the dimension modelling.
 A fact table is a primary table in a dimensional model.
 A dimension table contains dimensions of a fact.
 There are three types of facts 1. Additive 2. Non-additive 3. Semi- additive .
 Types of Dimensions are Conformed, Outrigger, Shrunken, Role-playing, Dimension to
Dimension Table, Junk, Degenerate, Swappable and Step Dimensions.
 Five steps of Dimensional modeling are 1. Identify Business Process 2. Identify Grain (level of
detail) 3. Identify Dimensions 4. Identify Facts 5. Build Star
 For Dimensional modelling in data warehouse, there is a need to ensure that every fact table
has an associated date dimension table.

INDEXING IN DATABASES

Indexing is a way to optimize the performance of a database by minimizing the number of disk accesses
required when a query is processed. It is a data structure technique which is used to quickly locate and
access the data in a database.

Indexes are created using a few database columns.

The first column is the Search key that contains a copy of the primary key or candidate key of the table.
These values are stored in sorted order so that the corresponding data can be accessed quickly.

Note: The data may or may not be stored in sorted order.

The second column is the Data Reference or Pointer which contains a set of pointers holding the address
of the disk block where that particular key value can be found.
The indexing has various attributes:

 Access Types: This refers to the type of access such as value based search, range access, etc.
 Access Time: It refers to the time needed to find particular data element or set of elements.
 Insertion Time: It refers to the time taken to find the appropriate space and insert a new data.
 Deletion Time: Time taken to find an item and delete it as well as update the index structure.
 Space Overhead: It refers to the additional space required by the index.

In general, there are two types of file organization mechanism which are followed by the indexing
methods to store the data:

1. Sequential File Organization or Ordered Index File: In this, the indices are based on a sorted
ordering of the values. These are generally fast and a more traditional type of storing
mechanism. These Ordered or Sequential file organization might store the data in a dense or
sparse format:

Dense Index:

For every search key value in the data file, there is an index record.

This record contains the search key and also a reference to the first data record with that search key
value.
Sparse Index:

The index record appears only for a few items in the data file. Each item points to a block as shown.

To locate a record, we find the index record with the largest search key value less than or equal to the
search key value we are looking for.

We start at that record pointed to by the index record, and proceed along with the pointers in the file
(that is, sequentially) until we find the desired record.

2. Hash File organization: Indices are based on the values being distributed uniformly across a
range of buckets. The buckets to which a value is assigned is determined by a function called a
hash function.

There are primarily three methods of indexing:


 Clustered Indexing
 Non-Clustered or Secondary Indexing
 Multilevel Indexing

1. Clustered Indexing

When more than two records are stored in the same file these types of storing known as cluster
indexing. By using the cluster indexing we can reduce the cost of searching reason being multiple
records related to the same thing are stored at one place and it also gives the frequent joing of more
than two tables(records).

Clustering index is defined on an ordered data file. The data file is ordered on a non-key field. In some
cases, the index is created on non-primary key columns which may not be unique for each record. In
such cases, in order to identify the records faster, we will group two or more columns together to get
the unique values and create index out of them. This method is known as the clustering index. Basically,
records with similar characteristics are grouped together and indexes are created for these groups.

For example, students studying in each semester are grouped together. i.e. 1st Semester students, 2nd
semester students, 3rd semester students etc are grouped.

Clustered index sorted according to first name (Search key)

Primary Indexing:

This is a type of Clustered Indexing wherein the data is sorted according to the search key and the
primary key of the database table is used to create the index. It is a default format of indexing where it
induces sequential file organization. As primary keys are unique and are stored in a sorted manner, the
performance of the searching operation is quite efficient.
2. Non-clustered or Secondary Indexing

A non clustered index just tells us where the data lies, i.e. it gives us a list of virtual pointers or
references to the location where the data is actually stored. Data is not physically stored in the order of
the index. Instead, data is present in leaf nodes. For eg. the contents page of a book. Each entry gives us
the page number or location of the information stored. The actual data here(information on each page
of the book) is not organized but we have an ordered reference(contents page) to where the data points
actually lie. We can have only dense ordering in the non-clustered index as sparse ordering is not
possible because data is not physically organized accordingly.

It requires more time as compared to the clustered index because some amount of extra work is done in
order to extract the data by further following the pointer. In the case of a clustered index, data is
directly present in front of the index.

3. Multilevel Indexing

With the growth of the size of the database, indices also grow. As the index is stored in the main
memory, a single-level index might become too large a size to store with multiple disk accesses. The
multilevel indexing segregates the main block into various smaller blocks so that the same can stored in
a single block. The outer blocks are divided into inner blocks which in turn are pointed to the data
blocks. This can be easily stored in the main memory with fewer overheads.
DATA INTEGRATION

Data integration refers to the technical and business processes used to combine data from multiple
sources to provide a unified, single view of the data.

What is Data Integration?

Data integration is the practice of consolidating data from disparate sources into a single dataset with
the ultimate goal of providing users with consistent access and delivery of data across the spectrum of
subjects and structure types, and to meet the information needs of all applications and business
processes. The data integration process is one of the main components in the overall data management
process, employed with increasing frequency as big data integration and the need to share existing data
continues to grow.
Data integration architects develop data integration software programs and data integration platforms
that facilitate an automated data integration process for connecting and routing data from source
systems to target systems. This can be achieved through a variety of data integration techniques,
including:

 Extract, Transform and Load: copies of datasets from disparate sources are gathered together,
harmonized, and loaded into a data warehouse or database
 Extract, Load and Transform: data is loaded as is into a big data system and transformed at a
later time for particular analytics uses
 Change Data Capture: identifies data changes in databases in real-time and applies them to a
data warehouse or other repositories
 Data Replication: data in one database is replicated to other databases to keep the information
the information synchronized to operational uses and for backup
 Data Virtualization: data from different systems are virtually combined to create a unified view
rather than loading data into a new repository
 Streaming Data Integration: a real time data integration method in which different streams of
data are continuously integrated and fed into analytics systems and data stores

Application Integration vs Data Integration

Data integration technologies were introduced as a response to the adoption of relational databases and
the growing need to efficiently move information between them, typically involving data at rest. In
contrast, application integration manages the integration of live, operational data in real time between
two or more applications.

The ultimate goal of application integration is to enable independently designed applications to operate
together, which requires data consistency among separate copies of data, management of the
integrated flow of multiple tasks executed by disparate applications, and, similar to data integration
requirements, a single user interface or service from which to access data and functionality from
independently designed applications.

A common tool for achieving application integration is cloud data integration, which refers to a system
of tools and technologies that connects various applications for the real time exchange of data and
processes and provides access by multiple devices over a network or via the internet

Data Integration Tools and Techniques

Data integration techniques are available across a broad range of organizational levels, from fully
automated to manual methods. Typical tools and techniques for data integration include:

 Manual Integration or Common User Interface: There is no unified view of the data. Users
operate with all relevant information accessing all the source systems.
 Application Based Integration: requires each application to implement all the integration efforts;
manageable with a small number of applications
 Middleware Data Integration: transfers integration logic from an application to a new
middleware layer
 Uniform Data Access: leaves data in the source systems and defines a set of views to provide a
unified view to users across the enterprise
 Common Data Storage or Physical Data Integration: creates a new system in which a copy of the
data from the source system is stored and managed independently of the original system

Developers may use Structured Query Language (SQL) to code a data integration system by hand. There
are also data integration toolkits available from various IT vendors that streamline, automate, and
document the development process.

Why is Data Integration Important?

Enterprises that wish to remain competitive and relevant are embracing big data and all its benefits and
challenges. Data integration supports queries in these enormous datasets, benefiting everything from
business intelligence and customer data analytics to data enrichment and real time information delivery.

One of the foremost use cases for data integration services and solutions is the management of business
and customer data. Enterprise data integration feeds integrated data into data warehouses or virtual
data integration architecture to support enterprise reporting, business intelligence (BI data integration),
and advanced analytics.

Customer data integration provides business managers and data analysts with a complete picture of key
performance indicators (KPIs), financial risks, customers, manufacturing and supply chain operations,
regulatory compliance efforts, and other aspects of business processes.

Data integration also plays an important role in the healthcare industry. Integrated data from different
patient records and clinics helps doctors in diagnosing medical conditions and diseases by organizing
data from different systems into a unified view of useful information from which useful insights can be
made. Effective data acquisition and integration also improves claims processing accuracy for medical
insurers and ensures a consistent and accurate record of patient names and contact information. This
exchange of information between different systems is often referred to as interoperability.

What is Big Data Integration?

Big data integration refers to the advanced data integration processes developed to manage the
enormous volume, variety, and velocity of big data, and combines this data from sources such as web
data, social media, machine-generated data, and data from the Internet of Things (IoT), into a single
framework.

Big data analytics platforms require scalability and high performance, emphasizing the need for a
common data integration platform that supports profiling and data quality, and drives insights by
providing the user with the most complete and up-to-date view of their enterprise.

Big data integration services employ real-time integration techniques, which complement traditional ETL
technologies and add dynamic context to continuously streaming data. Best practices for real-time data
integration address its dirty, moving, and temporal nature: more stimulation and testing is required
upfront, real-time systems and applications should be adopted, users should implement parallel and
coordinated ingestion engines, establish resiliency in each phase of the pipeline in anticipation of
component failure, and standardize data sources with APIs for better insights.

You might also like