You are on page 1of 37

Data warehousing concepts

Data: Data is collection of raw material in unorganized format, which refers an object.
Information: Organized data that has been meaning & value.
Knowledge: Processed data or information that conveys understanding or learning applicable to a
problem or activity.
In Data warehouse, the data is converted into information format to get knowledge for making
decisions.
What is Data Warehouse?
Data warehouse is a relational database management which is used for reporting and analyzing
the data to take managerial decisions. It is a central repository of data which is created by integrating
data from different sources and converting that data into information format for taking managerial
decisions. Data warehouses store current data as well as historical data and Read only data for
management reporting such as annual and quarterly comparisons.
Eg: Big Bazar-->C.E.O needs data annual data --> Ipod --> Hyd --> 2011 -->Sales
In an organization we have different departments like Sales dept, product dept, HR dept etc..
Now C.E.O wants ipod sales in hyd location for the year 2011. So, C.E.O collects the data from sales dept
and product dept and need to take some decisions on profits & loss based on the information. If he
want to take decisions he need historical data also.

Why Data Warehouse is implemented?
Enterprise is an integration of different departments which working for business. Different departments
will work on different transactions, we need to store all the transactions in a database is called (ODS)
Operational Data Store.
Characteristics of ODS:
1. ODS is Volatile I.e., Data changes in regular interval of time (Data changes randomly).
2. We are not maintaining any historical data.
E.g.: Big Bazar C.E.O needs to take decisions about a particular product. So he needs 3 to 4 years
previous data. But in ODS doesnt maintain any history data. So, we should maintain history data to take
decisions based on product sales. We need DWH.
In DWH -- > Data changes in a particular interval of time. E.g: Every day, weekly, monthly, yearly.
In ODS -- > Data changes in regular interval of time. E.g.: In seconds, minutes, hours.
Here, ODS is the Source & DWH is the Target.
Characteristics of DWH
A Data warehouse is a Historical, subject-oriented, integrated, time-variant and non-volatilecollection of
data in support of management's decision making process".
By Historical we mean, the data is continuously collected from sources and loaded in the warehouse.
The previously loaded data is not deleted for long period of time. This results in building historical data
in the warehouse.
By Subject Oriented we mean data grouped into a particular business area instead of the business as a
whole.
By Integrated we mean, collecting and merging data from various sources. These sources could be
disparate in nature.
By Time-variant we mean that all data in the data warehouse is identified with a particular time period.
By Non-volatile we mean, data that is loaded in the warehouse is based on business transactions in the
past, hence it is not expected to change over time.
Overview of DWH Architecture
1. We are having different source data like emp.txt, dept.xml, marks.xls, oracle etc, First
developers have to extract the data (To create individual tables) and loading into the staging
database (Here Transformations are not done)
2. Now Developers need to join these individual tables and get finally one single table (Here some
Transformations are done) then loading into the clients DWH.
3. Testers validate the data from sources to staging database. We have 3 scenarios

Data warehouse Architecture is having 3 types
1. Centralized DWH Architecture
2. Federated DWH Architecture
3. Tiered DWH Architecture

What is a Data Warehouse?
A data warehouse is a relational database that is designed for query and analysis rather
than for transaction processing. It usually contains historical data derived from
transaction data, but it can include data from other sources. It separates analysis
workload from transaction workload and enables an organization to consolidate data
from several sources.
In addition to a relational database, a data warehouse environment includes an
extraction, transportation, transformation, and loading (ETL) solution, an online
analytical processing (OLAP) engine, client analysis tools, and other applications that
manage the process of gathering data and delivering it to business users.
A common way of introducing data warehousing is to refer to the characteristics of a
data warehouse as set forth by William Inmon:
Subject Oriented
Integrated
Nonvolatile
Time Variant
Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more
about your company's sales data, you can build a warehouse that concentrates on
sales. Using this warehouse, you can answer questions like "Who was our best
customer for this item last year?" This ability to define a data warehouse by subject
matter, sales in this case, makes the data warehouse subject oriented.
Integrated
Integration is closely related to subject orientation. Data warehouses must put data
from disparate sources into a consistent format. They must resolve such problems as
naming conflicts and inconsistencies among units of measure. When they achieve this,
they are said to be integrated.
Nonvolatile
Nonvolatile means that, once entered into the warehouse, data should not change. This
is logical because the purpose of a warehouse is to enable you to analyze what has
occurred.
Time Variant
In order to discover trends in business, analysts need large amounts of data. This is
very much in contrast to online transaction processing (OLTP) systems, where
performance requirements demand that historical data be moved to an archive. A data
warehouse's focus on change over time is what is meant by the term time variant.
Contrasting OLTP and Data Warehousing Environments
Figure 1-1 illustrates key differences between an OLTP system and a data warehouse.
Figure 1-1 Contrasting OLTP and Data Warehousing Environments

Text description of the illustration dwhsg005.gif
One major difference between the types of system is that data warehouses are not
usually in third normal form (3NF), a type of data normalization common in OLTP
environments.
Difference between Dataware house & OLTP
Workload
Data warehouses are designed to accommodate ad hoc queries. You might not
know the workload of your data warehouse in advance, so a data warehouse
should be optimized to perform well for a wide variety of possible query
operations.
OLTP systems support only predefined operations. Your applications might be
specifically tuned or designed to support only these operations.
Data modifications
A data warehouse is updated on a regular basis by the ETL process (run nightly
or weekly) using bulk data modification techniques. The end users of a data
warehouse do not directly update the data warehouse.
In OLTP systems, end users routinely issue individual data modification
statements to the database. The OLTP database is always up to date, and
reflects the current state of each business transaction.
Schema design
Data warehouses often use denormalized or partially denormalized schemas
(such as a star schema) to optimize query performance.
OLTP systems often use fully normalized schemas to optimize
update/insert/delete performance, and to guarantee data consistency.
Typical operations
A typical data warehouse query scans thousands or millions of rows. For
example, "Find the total sales for all customers last month."
A typical OLTP operation accesses only a handful of records. For example,
"Retrieve the current order for this customer."
Historical data
Data warehouses usually store many months or years of data. This is to support
historical analysis.
OLTP systems usually store data from only a few weeks or months. The OLTP
system stores only historical data as needed to successfully meet the
requirements of the current transaction.
Data Warehouse Architectures
Data warehouses and their architectures vary depending upon the specifics of an
organization's situation. Three common architectures are:
Data Warehouse Architecture (Basic)
Data Warehouse Architecture (with a Staging Area)
Data Warehouse Architecture (with a Staging Area and Data Marts)
Data Warehouse Architecture (Basic)
Figure 1-2 shows a simple architecture for a data warehouse. End users directly access
data derived from several source systems through the data warehouse.
Figure 1-2 Architecture of a Data Warehouse

Text description of the illustration dwhsg013.gif
This illustrates three things:
Data Sources (operational systems and flat files)
Warehouse (metadata, summary data, and raw data)
Users (analysis, reporting, and mining)
In Figure 1-2, the metadata and raw data of a traditional OLTP system is present, as is
an additional type of data, summary data. Summaries are very valuable in data
warehouses because they pre-compute long operations in advance. For example, a
typical data warehouse query is to retrieve something like August sales. A summary
in Oracle is called a materialized view.
materialized view: A pre-computed table comprising aggregated or joined data
from fact and possibly dimension tables. Also known as a summary or aggregate
table.
Data Warehouse Architecture (with a Staging Area)
In Figure 1-2, you need to clean and process your operational data before putting it
into the warehouse. You can do this programmatically, although most data
warehouses use a staging area instead. A staging area simplifies building summaries
and general warehouse management. Figure 1-3 illustrates this typical architecture.
Figure 1-3 Architecture of a Data Warehouse with a Staging Area

Text description of the illustration dwhsg015.gif
This illustrates four things:
Data Sources (operational systems and flat files)
Staging Area (where data sources go before the warehouse)
Warehouse (metadata, summary data, and raw data)
Users (analysis, reporting, and mining)
Data Warehouse Architecture (with a Staging Area and Data Marts)
Although the architecture in Figure 1-3 is quite common, you may want to customize
your warehouse's architecture for different groups within your organization. You can
do this by adding data marts, which are systems designed for a particular line of
business. Figure 1-4 illustrates an example where purchasing, sales, and inventories
are separated. In this example, a financial analyst might want to analyze historical
data for purchases and sales.
Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts

Text description of the illustration dwhsg064.gif
This illustrates five things:
Data Sources (operational systems and flat files)
Staging Area (where data sources go before the warehouse)
Warehouse (metadata, summary data, and raw data)
Data Marts (purchasing, sales, and inventory)
Users (analysis, reporting, and mining)
2
Logical Design in Data Warehouses
This chapter tells you how to design a data warehousing environment and includes the
following topics:
Logical Versus Physical Design in Data Warehouses
Creating a Logical Design
Data Warehousing Schemas
Data Warehousing Objects
Logical Versus Physical Design in Data Warehouses
Your organization has decided to build a data warehouse. You have defined the
business requirements and agreed upon the scope of your application, and created a
conceptual design. Now you need to translate your requirements into a system
deliverable. To do so, you create the logical and physical design for the data
warehouse. You then define:
The specific data content
Relationships within and between groups of data
The system environment supporting your data warehouse
The data transformations required
The frequency with which data is refreshed
The logical design is more conceptual and abstract than the physical design. In the
logical design, you look at the logical relationships among the objects. In the physical
design, you look at the most effective way of storing and retrieving the objects as well
as handling them from a transportation and backup/recovery perspective.
Orient your design toward the needs of the end users. End users typically want to
perform analysis and look at aggregated data, rather than at individual transactions.
However, end users might not know what they need until they see it. In addition, a
well-planned design allows for growth and changes as the needs of users change and
evolve.
By beginning with the logical design, you focus on the information requirements and
save the implementation details for later.
Creating a Logical Design
A logical design is conceptual and abstract. You do not deal with the physical
implementation details yet. You deal only with defining the types of information that
you need.
One technique you can use to model your organization's logical information
requirements is entity-relationship modeling. Entity-relationship modeling involves
identifying the things of importance (entities), the properties of these things
(attributes), and how they are related to one another (relationships).
The process of logical design involves arranging data into a series of logical
relationships called entities and attributes. An entity represents a chunk of
information. In relational databases, an entity often maps to a table. Anattribute is a
component of an entity that helps define the uniqueness of the entity. In relational
databases, an attribute maps to a column.
To be sure that your data is consistent, you need to use unique identifiers. A unique
identifier is something you add to tables so that you can differentiate between the
same item when it appears in different places. In a physical design, this is usually a
primary key.
While entity-relationship diagramming has traditionally been associated with highly
normalized models such as OLTP applications, the technique is still useful for data
warehouse design in the form of dimensional modeling. In dimensional modeling,
instead of seeking to discover atomic units of information (such as entities and
attributes) and all of the relationships between them, you identify which information
belongs to a central fact table and which information belongs to its associated
dimension tables. You identify business subjects or fields of data, define relationships
between business subjects, and name the attributes for each subject.
Your logical design should result in (1) a set of entities and attributes corresponding to
fact tables and dimension tables and (2) a model of operational data from your source
into subject-oriented information in your target data warehouse schema.
You can create the logical design using a pen and paper, or you can use a design tool
such as Oracle Warehouse Builder (specifically designed to support modeling the
ETL process) or Oracle Designer (a general purpose modeling tool).
Data Warehousing Schemas
A schema is a collection of database objects, including tables, views, indexes, and
synonyms. You can arrange schema objects in the schema models designed for data
warehousing in a variety of ways. Most data warehouses use a dimensional model.
The model of your source data and the requirements of your users help you design the
data warehouse schema. You can sometimes get the source model from your
company's enterprise data model and reverse-engineer the logical data model for the
data warehouse from this. The physical implementation of the logical data warehouse
model may require some changes to adapt it to your system parameters--size of
machine, number of users, storage capacity, type of network, and software.
Star Schemas
The star schema is the simplest data warehouse schema. It is called a star schema
because the diagram resembles a star, with points radiating from a center. The center
of the star consists of one or more fact tables and the points of the star are the
dimension tables, as shown in Figure 2-1.
Figure 2-1 Star Schema

Text description of the illustration dwhsg007.gif
This illustrates a typical star schema. In it, the dimension tables are:
times
channels
products
customers



The fact table is sales. sales shows columns amount_sold and quantity_sold.



SQL set operators
1 Introduction
SQL set operators allows combine results from two or more SELECT statements. At first
sight this looks similar to SQL joins although there is big difference. SQL joins tends to
combine columns i.e. with each additionally joined table it is possible to select more and
more columns. SQL set operators on the other hand combine rows from different queries
with strong preconditions - all involved SELECTS must:
retrieve the same number of columns and
the data types of corresponding columns in each involved SELECT must be
compatible (either the same or with possibility implicitly convert to the data types of
the first SELECT statement).
Visually the difference can be explained as follows - joins tend to extend breadthways, but
set operations in depth.


NB 1! All examples are created for Oracle database and written according to Oracle syntax.
However it doesn't matter what database management system is used, many of them with
(probably) very little modifications or even exactly the same can be used for every other
DBMS supporting set operators. Exactly why they work or why not are described
for Oracle, SQL Server and MySQL. If you need to use them for other DBMSes then you
should check these examples yourself although I would be very pleased if you'd send me
information what examples are not working on what DBMSes. I will include this info here
along with your name.
Contents
1 Introduction
2 Set operator types and syntax
2.1 Common facts to remember
2.2 Used tables for examples
2.3 UNION [DISTINCT] and UNION ALL
2.4 EXCEPT [DISTINCT] and EXCEPT ALL
2.5 INTERSECT [DISTINCT] and INTERSECT ALL
2.6 Raising it to higher levels - are two table data equal?
3 Example usage for various DBMSes
3.1 Oracle
3.2 Microsoft SQL Server
3.3 MySQL
3.4 IBM DB2
4 References and more information

2 Set operator types and syntax
According to SQL Standard there are following Set operator types:
UNION [DISTINCT];
UNION ALL;
EXCEPT [DISTINCT];
EXCEPT ALL;
INTERSECT [DISTINCT];
INTERSECT ALL.
As we can see there are 3 basic types Union, Except and Intersect and all have 2
modifications either Distinct or All. SQL Standard does not enforce keyword Distinct and
some DBMSes for example Oracle and SQL Server even do not allow it, therefore if you see
just Union, Except or Intersect - these actually mean Union Distinct, Except Distinct and
Intersect Distinct.
It is already clear from the very syntax that Distinct modification removes duplicates from
the result set, but All modification retains them.
Query syntax is common for all of them:
<query1>
<SET OPERATOR>
<query1>
Each query1 and query2 is full-fledged SELECT statements with possible joins, subqueries
and other constructions. There is also possibility to combine more than 2 SELECT
statements with set operators among them.
Lets look in general overview what the result of each one of them is. Following chart defines
2 queries returning data and various set operator combinations among them.
Quer
yA
Que
ryB
QueryA
UNION
[DISTIN
CT]
QueryB
Exampl
e 1
QueryA
UNION
ALL
QueryB
Exampl
e 3
QueryA
EXCEPT
(MINUS)
[DISTIN
CT]
QueryB
Example
10
QueryB
EXCEPT
(MINUS)
[DISTIN
CT]
QueryA
Example
11
QueryA
EXCEPT
(MINUS
) ALL
QueryB
Exampl
e 14
QueryB
EXCEPT
(MINUS
) ALL
QueryA
Example
15
QueryA
INTERSE
CT
[DISTIN
CT]
QueryB
Exampl
e 17
QueryA
INTERSE
CT ALL
QueryB
Exampl
e 19
Riga Riga Riga Riga Tallinn Riga Vilnius Riga Riga
Riga Riga Tallinn Riga
Stockhol
m
Tallinn Vilnius Vilnius Riga
Riga
Vilni
us
Vilnius Riga Tallinn Vilnius Helsinki Vilnius
Tallin Vilni
Helsinki Riga Tallinn Helsinki
n us
Tallin
n
Vilni
us
Stockhol
m
Riga Helsinki
Tallin
n
Vilni
us
Tallinn
Stockhol
m

Vilniu
s
Hels
inki
Tallinn
Helsi
nki
Tallinn
Helsi
nki
Vilnius
Stock
holm
Vilnius
Vilnius
Vilnius
Vilnius
Helsinki
Helsinki
Helsinki

Stockhol
m

Next chart contains the same QueryA but QueryC returns 0 rows, just to feel some possible
quirks.
Quer
yA
Que
ryC
QueryA
UNION
[DISTINC
T]
QueryCE
xample 4
QueryA
UNION
ALL
QueryCE
xample 5
QueryA
EXCEPT
(MINUS)
[DISTINC
T]
QueryCE
xample
QueryA
EXCEPT
(MINUS)
ALL
QueryCE
xample
16
QueryC
EXCEP
T
(MINU
S)
[DISTI
NCT]
Query
Quer
yC
EXCE
PT
(MIN
US)
ALL
Quer
QueryA
INTERSEC
T
[DISTINC
T]
QueryC E
xample
Query
A
INTER
SECT
ALL
Query
C
12 A yA 18
Riga Riga Riga Riga Riga
Riga Tallinn Riga Tallinn Riga
Riga Vilnius Riga Vilnius Riga
Tallin
n
Helsinki Tallinn Helsinki Tallinn
Tallin
n
Stockholm Tallinn Stockholm Tallinn
Tallin
n
Tallinn Tallinn
Vilniu
s
Vilnius Vilnius
Helsin
ki
Helsinki Helsinki
Helsin
ki
Helsinki Helsinki
Stock
holm
Stockholm Stockholm
2.1 Common facts to remember
There are some facts which probably aren't obvious and should be mentioned. Let's expand
requirements for queries to be combined using one of the set operators:
column count must be the same;
data types of retrieved columns should match or at least should be implicitly
convertible by database;
one can use many set operators for example Query1 UNION ALL Query2 UNION ALL
Query3 MINUS Query4 INTERSECT Query5. In such case one should look into used
DB documentation what is the order of operators, because for example Oracle
executes operators starting from left to right, but DB2 firstly executes Intersect;
Usually returned column names are taken from the first query;
Order by clauses for each individual query except the last one cannot be at all
(Oracle) or are ignored (MySQL).
Some other facts:
UNION and INTERSECT operators are commutative, i.e. the order of queries is not
important; it doesn't change the final result. See Example 1 andExample 2.
EXCEPT operator is NOT commutative, it IS important which query is first, which
second using EXCEPT operator. See Example 10 and Example 11.
UNION, EXCEPT and INTERSECT used without anything or with DISTINCT returns
only unique values. This is especially interesting when one query returning many
nonunique rows is UNIONED to another query returning zero rows (Example 4). The
final result contains fewer rows than first query.
If you know that result sets returned by each query are unique then use UNION ALL,
because database doesn't know that and uses more (wasted) resources to filter out
duplicates in case of UNION.
If you need determined ordering then use Order by clause in the last query. Don't
assume that rows from first query will always be returned first.
If you need to distinguish which query produced rows then you can add some tag or
flag column indicating which query produced them.
NULL values using set operators are considered to be equal to each other (Example
9).
2.2 Used tables for examples
Throughout this entire article we will use following tables and table data (the same data as
used in tables above):
CREATE TABLE table1 (
id INTEGER NOT NULL PRIMARY KEY,
city VARCHAR(10) NOT NULL);
CREATE TABLE table2 (
id INTEGER NOT NULL PRIMARY KEY,
city VARCHAR(10) NOT NULL);
CREATE TABLE table3 (
city VARCHAR(10) NOT NULL);
INSERT INTO table1 VALUES (1, 'RIGA');
INSERT INTO table1 VALUES (2, 'RIGA');
INSERT INTO table1 VALUES (3, 'RIGA');
INSERT INTO table1 VALUES (4, 'TALLINN');
INSERT INTO table1 VALUES (5, 'TALLINN');
INSERT INTO table1 VALUES (6, 'TALLINN');
INSERT INTO table1 VALUES (7, 'VILNIUS');
INSERT INTO table1 VALUES (8, 'HELSINKI');
INSERT INTO table1 VALUES (9, 'HELSINKI');
INSERT INTO table1 VALUES (10, 'STOCKHOLM');
INSERT INTO table2 VALUES (1, 'RIGA');
INSERT INTO table2 VALUES (2, 'RIGA');
INSERT INTO table2 VALUES (3, 'VILNIUS');
INSERT INTO table2 VALUES (4, 'VILNIUS');
INSERT INTO table2 VALUES (5, 'VILNIUS');
INSERT INTO table2 VALUES (6, 'VILNIUS');
INSERT INTO table2 VALUES (7, 'HELSINKI');
COMMIT;
2.3 UNION [DISTINCT] and UNION ALL
These usually are most widely used set operators. Quite many times one cannot get all the
result from one Select statement. Then one of the UNIONS can help.
Graphically UNION can be visualised using Venn diagrams. Assume we have two row sets.

Then Query1 UNION Query2 would be as follows. Grey area shows resultant set.

Of course the previous picture is very general visualisation and fully real just for sets which
contains each element no more than once.
Query1 UNION ALL Query2 would be as follows:


2.3.1 Examples
As we can see only unique rows are retuned in next example.
Example 1 Unions cities from table1 and table2.
SELECT city FROM table1
UNION
SELECT city FROM table2;

CITY
----------
HELSINKI
RIGA
STOCKHOLM
TALLINN
VILNIUS
Example 2 Unions cities from table2 and table1. The query ordering is not
important, result is the same, compare with Example 1.
SELECT city FROM table2
UNION
SELECT city FROM table1;

CITY
----------
HELSINKI
RIGA
STOCKHOLM
TALLINN
VILNIUS
DO NOT ASSUME that Union always return ordered row set. It is NOT TRUE. It is just
because of implementation model, i.e. sort is being done to filter out duplicates. At least
from version 10 Oracle has possibility to do HASH UNIQUE operation, which doesn't sort
rows and you won't get them back sorted. So ALWAYS use Order by clause if you need
guaranteed order of rows.
Next example just combines the rows without filtering out duplicates.

Example 3 Unions ALL cities from table1 and table2.
SELECT city FROM table1
UNION ALL
SELECT city FROM table2;

CITY
----------
RIGA
RIGA
RIGA
TALLINN
TALLINN
TALLINN
VILNIUS
HELSINKI
HELSINKI
STOCKHOLM
RIGA
RIGA
VILNIUS
VILNIUS
VILNIUS
VILNIUS
HELSINKI

17 rows selected.
Example 4 UNION [DISTINCT] even with empty set may reduce number of
rows. Compare result from first two queries with third query.
SELECT city FROM table1;

CITY
----------
RIGA
RIGA
RIGA
TALLINN
TALLINN
TALLINN
VILNIUS
HELSINKI
HELSINKI
STOCKHOLM

10 rows selected.

SELECT city FROM table3;

no rows selected

SELECT city FROM table1
UNION
SELECT city FROM table3;

CITY
----------
HELSINKI
RIGA
STOCKHOLM
TALLINN
VILNIUS
Example 5 UNION ALL with empty set gives the same result as without it.
SELECT city FROM table1
UNION ALL
SELECT city FROM table3;

CITY
----------
RIGA
RIGA
RIGA
TALLINN
TALLINN
TALLINN
VILNIUS
HELSINKI
HELSINKI
STOCKHOLM

10 rows selected.
Example 6 Each query in Union must return the same number of columns.
SELECT * FROM table1
UNION
SELECT city FROM table2;
SELECT * FROM table1
*
ERROR at line 1:
ORA-01789: query block has incorrect number of result
columns
Example 7 Of course query can be Unioned to itself. This time all rows are
returned because combination of both columns is reviewed.
SELECT * FROM table1
UNION
SELECT * FROM table1;

ID CITY
---------- ----------
1 RIGA
2 RIGA
3 RIGA
4 TALLINN
5 TALLINN
6 TALLINN
7 VILNIUS
8 HELSINKI
9 HELSINKI
10 STOCKHOLM

10 rows selected.
Along with subquery factoring clause (or common table expression clause, "with" clause)
UNION ALL can be used to generate some sample data without having actual tables. It has
become very popular in Oracle forums.

Example 8 Using "with" clause to generate sample test data to test inner
join functionality.
WITH cities AS (
SELECT 1 as cty_id, 'RIGA' as city FROM dual
UNION ALL
SELECT 2, 'TALLINN' FROM dual
),
streets AS (
SELECT 1 as str_id, 1 as str_cty_id, 'BRIVIBAS' as
street FROM dual
UNION ALL
SELECT 2, 2, 'NARVA MNT'FROM dual
)
SELECT city, street FROM cities
INNER JOIN streets ON (str_cty_id = cty_id);

CITY STREET
------- ---------
RIGA BRIVIBAS
TALLINN NARVA MNT
NULL values are considered equal when using with set operators. This is different than
usually, for example, testing for eaquality.
Example 9 Using "with" clause to generate two NULL values and unioning
them.
WITH null1 AS (
SELECT NULL value FROM dual
),
null2 AS (
SELECT NULL value FROM dual
)
SELECT value FROM null1
UNION
SELECT value FROM null2;

V
-


1 row selected.
2.4 EXCEPT [DISTINCT] and EXCEPT ALL
EXCEPT returns unique rows that are returned by the first query but are NOT returned by
the second query. EXCEPT ALL does the same but retains cardinality, for example, if the
first query returns two values of X and second only one, then EXCEPT won't return X but
EXCEPT ALL would return one instance of X.
Oracle uses MINUS operator instead of EXCEPT, but the functionality is the same. None of
the Oracle, SQL Server and MySQL has implemented EXCEPT ALL. It can be simulated using
analytic functions as shown in Example 14 till Example 16.
Usually EXCEPT is used to compare date in different data sources (tables) to find
differences, for example, differences in the same tables across test and production and/or
actual copy and backup.
Visually Query1 EXCEPT Query2 can be expressed as follows:

Obviously diagram is not symmetric therefore for Query2 EXCEPT Query1 we get different
picture:


2.4.1 Examples
Example 10 Cities in table1 except (minus) [distinct] cities in table2.
SELECT city FROM table1
MINUS
SELECT city FROM table2;

CITY
----------
STOCKHOLM
TALLINN
Example 11 Cities in table2 except (minus) [distinct] cities in table1. Of
course the result is different than in Example 10.
SELECT city FROM table2
MINUS
SELECT city FROM table1;

no rows selected
As MINUS filters out duplicates then even subtracting empty set may reduce the initial set.
Example 12 Cities in table1 except (minus) [distinct] empty set (cities in
table3).
SELECT city FROM table1
MINUS
SELECT city FROM table3;

CITY
----------
HELSINKI
RIGA
STOCKHOLM
TALLINN
VILNIUS
It is not possible in Oracle and SQL Server to use EXCEPT (MINUS) ALL directly.
Example 13 Minus all doesn't exist in Oracle.
SELECT city FROM table1
MINUS ALL
SELECT city FROM table2;

MINUS ALL
*
ERROR at line 2:
ORA-00928: missing SELECT keyword
However using analytic functions and simple minus it is possible. The main idea for MINUS
(EXCEPT) ALL is to retain cardinality, i.e. how many instances of each row exists in source
sets. Here analytic function row_number() can help. It just increments counter for each row
which is the same as previous and restarts if the row values change. Then we can use
simple MINUS [DISTINCT], and show only business columns.
Example 14 Faked minus all using row_number() analytic function. Cities
in table1 except (minus) all cities in table2.
SELECT city FROM (
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table1
MINUS
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table2
) q;

CITY
----------
HELSINKI
RIGA
STOCKHOLM
TALLINN
TALLINN
TALLINN
Example 15 Faked minus all using row_number() analytic function. Cities
in table2 except (minus) all cities in table1.
SELECT city FROM (
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table2
MINUS
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table1
) q;

CITY
----------
VILNIUS
VILNIUS
VILNIUS
Any set Minus all empty set doesn't change. Just like with Union all.
Example 16 Faked minus all using row_number() analytic function. Cities
in table1 except (minus) all empty set (cities in table3).
SELECT city FROM (
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table1
MINUS
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table3
) q;

CITY
----------
HELSINKI
HELSINKI
RIGA
RIGA
RIGA
STOCKHOLM
TALLINN
TALLINN
TALLINN
VILNIUS
It is obvious that subtracting anything from empty set will always be empty set therefore I
won't show you these examples. It is true for both modifications of except (minus) - distinct
and all.
2.5 INTERSECT [DISTINCT] and INTERSECT ALL
Intersect returns only these rows, which are in both tables. Intersect [distinct] returns just
unique rows, but intersect all retains cardinality. Intersect is commutative, just like union -
it is not important which query is the first, which second one.
Picture for Query1 INTERSECT Query2 is as follows:


2.5.1 Examples
Example 17 Cities in table1 intersect [distinct] cities in table2.
SELECT city FROM table1
INTERSECT
SELECT city FROM table2;

CITY
----------
HELSINKI
RIGA
VILNIUS
Example 18 Cities in table2 intersect [distinct] empty set (cities in table3).
Every intersection with empty set is empty set.
SELECT city FROM table1
INTERSECT
SELECT city FROM table3;

no rows selected
Intersect all is not possible in Oracle or SQL Server just like with Minus (Except) all. But we
can use already known workaround.
Example 19 Faked intersect all using row_number() analytic function.
Cities in table1 intersect all cities in table2.
SELECT city FROM (
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table1
INTERSECT
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table2
) q;

CITY
----------
HELSINKI
RIGA
RIGA
VILNIUS
2.6 Raising it to higher levels - are two table data equal?
There are times when we need to find whether two table data are equal. And here I mean
"really equal" i.e. both the rows are equal and in case of duplicate rows cardinality of them
also are the same. So what we need is to test whether the "opposite" of Intersect i.e. rows
that are returned only by the first query or the second query is empty set. Visually it would
be as in following picture grey area would be empty.

In set theory "the opposite" of intersect can be referred as Symmetric difference, which is
similar to XOR (exclusive OR) in Boolean logic.
Unfortunately there isn't such Symmetric difference operator in SQL. So we need to be more
creative. Looking at previous pictures throughout this article it is quite obvious what we
need:
(Query 1 MINUS Query2)
UNION
(Query 2 MINUS Query1)
In case of absolutely unique rows it would be sufficient - as soon as it returns at least one
row, tables' data are not equal. But. We have to remember that there might be duplicates
and amount of them might be different in both result sets. So then we'd need:
(Query 1 MINUS ALL Query2)
UNION ALL
(Query 2 MINUS ALL Query1)
Let's look at real examples.
Example 20 Distinct Symmetric difference of table1 and table2.
(SELECT city FROM table1
MINUS
SELECT city FROM table2)
UNION
(SELECT city FROM table1
MINUS
SELECT city FROM table2);

CITY
----------
STOCKHOLM
TALLINN
It is obvious that somehow these tables are different. But exactly how? Then we'd need
smarter query.
Example 21 Symmetric difference retaining cardinality of table1 and
table2.
SELECT city FROM (
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table1
MINUS
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table2
) q
UNION ALL
SELECT city FROM (
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table2
MINUS
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table1
) q;

CITY
----------
HELSINKI
RIGA
STOCKHOLM
TALLINN
TALLINN
TALLINN
VILNIUS
VILNIUS
VILNIUS
So these are rows that are left outside in one or another table. What if we'd like to know in
what table exactly? Just add a flag column.
Example 22 Symmetric difference retaining cardinality and showing what
is missed of table1 and table2.
SELECT city, 2 flag FROM (
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table1
MINUS
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table2
) q
UNION ALL
SELECT city, 1 flag FROM (
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table2
MINUS
SELECT city,
row_number() OVER (PARTITION BY city ORDER BY
city) rn
FROM table1
) q;

CITY FLAG
---------- ----------
HELSINKI 2
RIGA 2
STOCKHOLM 2
TALLINN 2
TALLINN 2
TALLINN 2
VILNIUS 1
VILNIUS 1
VILNIUS 1
So we can see that table2 misses one Helsinki, one Stockholm and 3 Tallin rows and table1
misses 3 Vilnius rows. If we'd add these rows, then they'd contain exactly the same cities
with exactly the same cardinality.
3 Example usage for various DBMSes
3.1 Oracle
All examples were created and tested on Oracle 10g. 11g did not introduce new features for
set operators. Analytic functions were introduced in 8i, and subquery factoring clause in 9i
so examples containing these won't work in earlier versions. Examples with pure set
operators should work even on more prehistoric :) versions than 8i.
3.2 Microsoft SQL Server
Examples were tested on SQL Server 2008. SQL Server haven't predefined dual table, but it
is possible to select one row without From clause. ThereforeExample 8 and Example 9 work
if FROM dual is deleted. SQL Server has Except operator therefore Example
10 through Example 16 and Example 20, Example 21, Example 22 work if minus is
replaced with except.
3.3 MySQL
Examples were tested on MySQL 6.0. MySQL doesn't support query factoring clause
(common table expression) therefore Example 8 and Example 9 do not work. However the
fact about Null values shown in Example 9 is true also in MySQL. Examples starting
from Example 10 do not work because MySQL does not support neither Except nor Intersect
operators.
3.4 IBM DB2
According to DB2 documentation manual it supports all set operators as well as common
table expression so all examples should work (Minus Must be changed to Except).
Unfortunately I haven't installed DB to ensure this.
4 References and more information
[1] Mastering Oracle SQL By Sanjay Mishra, Alan Beaulieu Chapter 7 Set Operations;
[2] Oracle Database SQL Language Reference 11g Release 1 The UNION [ALL], INTERSECT,
MINUS Operators;
[3] DB2 version 9.1. Combining result tables from multiple SELECT statements;
[4] SQL Server 2008 Books Online (May 2008) EXCEPT and INTERSECT (Transact-SQL);
[5] SQL Server 2008 Books Online (May 2008) UNION (Transact-SQL);
[6] UNION (ALL) Syntax MySQL 6.0 Reference Manual ;
[7] MINUS ALL and INTERSECT ALL in Oracle Revisited;
[8] SQL join types.

You might also like