You are on page 1of 9

What is Data warehouse?

Centralized database that is used to capture information from different parts of the business process.
Integration is closely related to subject orientation. e.g (Customer feedback form- How you get to know about us.)
A data warehouse is an electronic storage of an Organization's historical data for the purpose of reporting, analysis and data
mining or knowledge discovery.
Data warehouse is subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's
decision making process.
Subject-Oriented: to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using
this warehouse, you can answer questions like "Who was our best customer for this item last year?
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have
different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse.
Non-volatile: Nonvolatile means that, once entered into the warehouse, data does not change. This is logical because the
purpose of a warehouse is to enable you to analyze what has occurred in past.

What are the benefits of data warehouse?

A data warehouse helps to integrate data (see Data integration) and store them historically so that we can analyze different
aspects of business including, performance analysis, trend, prediction etc. over a given time frame and use the result of our
analysis to improve the efficiency of business processes.
Need: to help management and businesses analyze data and this helps to fill the need for subject-oriented concepts.

What is Data mining, Data Mart, Operation Data Source?

Data mining
Is the process of finding patterns in a given data set. These patterns can often provide meaningful and insightful data to

whoever is interested in that data. Data mining is a method for comparing large amounts of data for the purpose of finding

patterns. Data mining is normally used for models and forecasting.

Data warehousing is the process of compiling and organizing data into one common database, and data mining is the process

of extracting meaningful data from that database.

Data mining is typically done by business users with the assistance of engineers, and data warehousing is typically a process
done exclusively by engineers.

Data warehousing is a process that must occur before any data mining can take place.

Data warehousing can be said to be the process of centralizing or aggregating data from multiple sources into one common

repository.

Eg. credit card companies have a history of your purchases from the past and know geographically where those purchases
have been made. If all of a sudden some purchases are made in a city far from where you live, the credit card companies are

put on alert to a possible fraud since their data mining shows that you don’t normally make purchases in that city. Then, the

credit card company can disable your card for that transaction or just put a flag on your card for suspicious activity.
Data Mart
Data marts are generally designed for a single subject area. An organization may have data pertaining to different departments
like Finance, HR, Marketting etc. stored in data warehouse and each department may have separate data marts. These data
marts can be built on top of the data warehouse. Often holds only one subject area- for example, Finance, or Sales .Is built
focused on a dimensional model using a star schema.

Operation Data Source


ODS is the part of data warehouse architecture. Where you are collecting and integrating the data and ensures the
completeness and accuracy of the data, and it provides the near data of warehouse data,
The ODS is a component of the Data Warehouse, and is an "integrated, operational view" of the business. Typically, numerous
disparate sources will be staged, connected, integrated and synchronized in the ODS. It is a very handy place for operational
users to get a daily snapshot of data.
This becomes the foundation for providing the end-user community with an integrated view of enterprise data to enable users
anywhere in the organization to access information for strategic and/or tactical decision support, day-to-day operations
and management reporting.
Normalized form, Denormalized form
Contains current and near current data, Data Warehouse contains historical data
Data can be deleted Data is not deleted
Real-time and near real-time data loads, Data Warehouse typically batch data loads
Data is used for tactical (short term ), Data is used for strategic ( long term ) decision making decision making
Updated at the data field level, Data Warehouse data is appended not updated

Explain Data warehouse Architecture?

Data Source Layer- sales data, hr data, inventory data, market research data, third party data-census data.
Data Extraction Layer
Staging area- temporary storage area where the data sits prior to being transformed into data warehouse
ETL-layer – logic applied to transform the data
Data Storage layer- where the transformed and cleansed data sit
Data logic layer- Business rules are stored.
Data Presentation Layer- This refers to the information that reaches the user.
Metadata layer- This is where information about the data stored in the data warehouse system is stored. A logical data model
would be an example of something that's in the metadata layer.
System Operations Layer - This layer includes information on how the data warehouse system operates, such as ETL job status,
system performance, and user access history.

What is Data Model and Data Modeling?

A data model in software engineering is an abstract model that describes how data is represented and accessed.
A data model can be thought of as a diagram or flowchart that illustrates the relationships between data
A data model is a graphical view of data created for analysis and design purposes
We can see that the complexity increases from conceptual to logical to physical. This is why we always first start with the
Conceptual data model (so we understand at high level what are the different entities in our data and how they relate to one
another), then move on
identify the scope of the project in terms of the subject areas involved.
Logical data model (so we understand the details of our data without worrying about how they will actually implemented),
Logical Model identifies the details of the subject areas identified in the conceptual model, and the relationships between t he
subject areas. Logical model identifies the entities within the subject area and documents in detail the definition of the entity,
attributes, candidate keys, data types and the relationship with other entities.
Physical data model (so we know exactly how to implement our data model in the database of choice). Physical Model identifies
the table structure, Columns within the tables, Primary Keys, data types and size of the columns, constraints on the table and
the privileges on the tables.
In a data warehousing project, sometimes the conceptual data model and the logical data model are considered as a single
deliverable.
Data Modeling-
This activity is performed once we have understood the requirements for the data warehouse/mart. Data modeling refers to
creating the structure and relationship of the data for the business rules defined in the requirements

What is dimensional modeling?

Dimensional modeling is one of the logical design techniques used in data warehousing. DM uses facts and dimensions of a
warehouse for its design. The goal of Dimensional model is not to achieve high degree of normalization but to facilitate easy
and faster data retrieval.
This is the form of data modeling used in data warehousing where we store business related attributes in tables
called Dimension Tables and the business related metrics/measures in tables called Fact Tables.

Dimension: A dimension is something that qualifies a quantity (measure). Dimension tables are used to describe dimensions;
they contain dimension keys, values and attributes.
A category of information mainly contains descriptive text. For example, the time dimensions.
For an example, consider this: If I just say… “20kg”, it does not mean anything. But if I say, "20kg of Rice (Product) is sold to
Ramesh (customer) on 5th April (date)", then that gives a meaningful sense. These product, customer and dates are some
dimension that qualified the measure - 20kg.

Fact: A fact table is a table that contains the measures of interest. A fact is something that is quantifiable (Or measurable).
Facts are typically (but not always) numerical values that can be aggregated. A fact table typically has two types of columns:
those that contain facts and those that are foreign keys to dimension tables.

Suppose our company sells products to customers. Every sale is a fact that happens within our company and the fact table is
used to record these facts.
The fact table lists events that happen in our company (or at least the events that we want to analyze). The dimension tables list
the factors (Customer, Time, Product) by which we want to analyze the data.

In other words, the difference between dimension tables and fact tables is that fact tables hold the data we want to analyze and
the dimension tables hold the information necessary to allow us to query it.
Dimension table and fact table are mainly used in data warehousing. The fact table mainly consists of business facts and
foreign keys that refer to primary keys in the dimension tables.
These measurable facts are used to know the business value and to forecast the future business.
A dimension table consists mainly of descriptive attributes that are textual fields.
Dimension tables provide descriptive or contextual information for the measurement of a fact table. On the other hand, fact
tables provide the measurements of an enterprise.
A dimension table contains a surrogate key, natural key, and a set of attributes. On the contrary, a fact table contains a foreign
key, measurements, and degenerated dimensions.
Fact contains normalized data and it takes more memory because in which we have less no.of columns and more no.of
records(i mean it maintains historical data)
Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table
may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts.
The first example presented here is a cumulative fact table.
Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes
more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Additive Measures
 Measures that can be added across all dimensions.Additive measures can be used with any aggregation function like
Sum(), Avg() etc. Example is Sales Quantity etc.

Semi Additive Measures


 Measures that can be added across few dimensions and not with others . Semi-additive measures are those where only
a subset of aggregation function can be applied. Let’s say account balance. A sum() function on balance does not give
a useful result but max() or min() balance might be useful. Consider price rate or currency rate. Sum is meaningless
on rate; however, average function might be useful.

Non-additive Measures
 Measures that cannot be added across all dimensions. on-additive measures are those which cannot be used inside
any numeric aggregation function (e.g. SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or
percentage. Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a non-additive
measure when that data is stored in fact tables, e.g. some kind of varchar flags in the fact table.

What is a factless fact table ?

A fact table that does not contain fact. They contain only dimensional keys and it captures events that happen only at
information level but not included in the calculations level. just an information about an event that happen over a period.
A factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts.
They are often used to record events or coverage information. Common examples of factless fact tables include:
 Identifying product promotion events (to determine promoted products that didn’t sell)
 Tracking student attendance or registration events
 Tracking insurance-related accident events

The above fact is used to capture the leave taken by an employee. Whenever an employee takes leave a record is created with
the dimensions. Using the fact FACT_LEAVE we can answer many questions like
 Number of leaves taken by an employee
 The type of leave an employee takes
 Details of the employee who took leave

What are the Different Type of Schema?

Star Schema, Snowflake Schema and Fact Constellation Schema.


Star schema - A star schema has one fact table and is associated with numerous dimensions table and depicts a star. This
schema is used in data warehouse models where one centralized fact table references number of dimension tables so as the
keys (primary key) from all the dimension tables flow into the fact table (as foreign key) where measures are stored. This entity-
relationship diagram looks like a star, hence the name. The primary key in each dimension table is related to a foreign key in
the fact table.

Snowflake schema

A snowflake schema is a more normalized form of a star schema. In a star schema, each dimension is represented by a single
dimensional table, whereas Dimensional table is normalized into multiple lookup tables, each representing a level in the
dimensional hierarchy.
Fact Constellation Schema
This Schema is used mainly for the aggregate fact tables, OR where we want to split a fact table for better comprehension. The
split of fact table is done only when we want tofocus on aggregation over few facts & dimensions.

Difference between Star Schema and Snow Flake Schema ?

In a start schema,
1. The fact table will be at the center and is connected to the dimension tables.
2. The tables are completely in denormalized structure.
3. SQL queries performance is good as there are less number of joins involved.
4. Data redundancy is high and occupies more disk space.

Snow Flake Schema:


1. Snow flake schema is an extension of star schema where the dimension tables are further connected to one or more
dimensions.
2. In snow flake schema since there is relationship between the dimensions tables it has to do many joins to fetch the
data.
3. Snowflake schemas normalize dimensions to eliminated redundancy. The result is more complex queries and reduced
query performance
4. The performance of SQL queries are a bit less when compared to star schema as more number of joins are involved.
5. Data redundancy is low and occupies less disk space when compared to star schema.
What are slowly Changing Dimensions?

Dimensions that change over time. SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly
changing.

Type 1 Slowly Changing Dimension


In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no
history is kept. A type 1 dimension is where history is not maintained and the table always shows the recent data. This
effectively means that such dimension table is always updated with recent data whenever there is a change, and because of this
update, we lose the previous values.

Type 2 Slowly Changing Dimension


In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the
original and the new record will be present. The new record gets its own primary key.

Advantages:
 This allows us to accurately keep all historical information.

Disadvantages:
 This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with,
storage and performance can become a concern.
 This necessarily complicates the ETL process.
Type 3 Slowly Changing Dimension
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating
the original value, and one indicating the current value. There will also be a column that indicates when the current value
becomes active.

Advantages:
This does not increase the size of the table, since new information is updated.
 This allows us to keep some part of history.
Used when changes occur for finite no of time.

What are the different types of dimension?

In a data warehouse model, dimension can be of following types,

1. Conformed Dimension
Confirmed Dimensions are nothing but Reusable Dimensions. A conformed dimension is the dimension that is shared
across multiple subject area. Consider 'Customer' dimension. Both marketing and sales department may use the same
customer dimension table in their reports. To join two fact tables indirectly.

2. Junk Dimension
A junk dimension is a grouping of typically low-cardinality attributes (flags, indicators etc.) so that those can be
removed from other tables and can be junked into an abstract dimension table. The junk dimension is simply a structure
that provides a convenient place to store the junk attributes.
E.g.: Assume that we have a gender dimension and marital status dimension. In the fact table we need to maintain two
keys referring to these dimensions. Instead of that create a junk dimension which has all the combinations of gender
and marital status (cross join gender and marital status table and create a junk table). Now we can maintain only one key
in the fact table.
3. Degenerated Dimension
A degenerated dimension is a dimension that is derived from fact table and does not have its own dimension table.
A dimension key, such as transaction number, receipt number, Invoice number etc. does not have any more associated
attributes and hence cannot be designed as a dimension table.
A degenerate dimension is a dimension that is stored in the fact table rather than the dimension table. It eliminates the
need to join to a Dimension table
 These are essentially dimension keys for which there are no other attributes.

 Degenerate Dimensions are used when fact tables represent transactional data.
 can be used as primary key for the fact table but they cannot act as foreign keys.

 degenerated dimension is a dimension which is not having any source in oltp

 it is generated at the time of transaction


 like invoice no this is generated when the invoice is raised

4. Role Playing Dimension


Dimensions are often reused for multiple applicati ons within the same database with different contextual meaning. For
instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". This is often
referred to as a 'role-playing dimension'

Data integrity refers to the validity of data, meaning data is consistent and correct.
Comparison of OLTP and OLAP?

OLTP systems provide source data to data warehouses, whereas


OLAP systems help to analyze it.

OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE
and DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-
access environments and an effectiveness measured by number of transactions per second.
An OLTP database is used for day-to-day business operations and is therefore well-normalized.

OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very
complex and involve aggregations. For OLAP systems a response time is anOLAP applications are widely used by Data Mining
techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star)
MLOP -
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is
not in the relational database, but in proprietary formats.MOLAP cubes are built for fast data retrieval.

Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for slicing and dicing operations.
Can perform complex calculations: All calculations have been pre
Limitation of handling large amount of data
RLOP-
This methodology relies on manipulating the data stored in the relational database to give the appearance of
traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a
"WHERE" clause in the SQL statement.
Limitation : Performance can be slow

OLTP OLAP
Source of data Operational data; OLTPs are the original Consolidation data; OLAP data comes
source of the data. from the various OLTP Databases
Purpose of data To control and run fundamental business To help with planning, problem solving,
tasks and decision support
What the data Reveals a snapshot of ongoing business Multi-dimensional views of various kinds
processes of business activities
Inserts and Updates Short and fast inserts and updates Periodic long-running batch jobs refresh
initiated by end users the data
Queries Relatively standardized and simple Often complex queries involving
queries Returning relatively few records aggregations
Processing Speed Typically very fast Depends on the amount of data involved;
batch data refreshes and complex queries
may take many hours; query speed can be
improved by creating indexes

You might also like