Professional Documents
Culture Documents
There are two leading approaches to storing data in a data warehouse - the dimensional approach and
the normalized approach.
In the dimensional approach, transaction data are partitioned into either "facts", which are generally
numeric transaction data, and "dimensions", which are the reference information that gives context to
the facts. For example, a sales transaction can be broken up into facts such as the number of products
ordered and the price paid for the products, and into dimensions such as order date, customer name,
product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order.
A key advantage of a dimensional approach is that the data warehouse is easier for the user to
understand and to use. Also, the retrieval of data from the data warehouse tends to operate very
quickly. The main disadvantages of the dimensional approach are:
1) In order to maintain the integrity of facts and dimensions, loading the data warehouse with data
from different operational systems is complicated, and
2) It is difficult to modify the data warehouse structure if the organization adopting the dimensional
approach changes the way in which it does business.
In the normalized approach, the data in the data warehouse are stored following, to a degree, the Codd
normalization rule. Tables are grouped together by subject areas that reflect general data categories
(e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it is
straightforward to add information into the database. A disadvantage of this approach is that, because
of the number of tables involved, it can be difficult for users both to
1) join data from different sources into meaningful information and then
2) access the information without a precise understanding of the sources of data and of the data
structure of the data warehouse.
These approaches are not exact opposites of each other. Dimensional approaches can involve
Organizations generally start off with relatively simple use of data warehousing. Over time, more
sophisticated use of data warehousing evolves. The following general stages of use of the data
warehouse can be distinguished:
Off line Operational Databases
Data warehouses in this initial stage are developed by simply copying the data of an operational system
to another server where the processing load of reporting against the copied data does not impact the
operational system's performance.
Off line Data Warehouse
Data warehouses at this stage are updated from data in the operational systems on a regular basis and
the data warehouse data is stored in a data structure designed to facilitate reporting.
Real Time Data Warehouse
Data warehouses at this stage are updated every time an operational system performs a transaction
(e.g., an order or a delivery or a booking.)
Integrated Data Warehouse
Data warehouses at this stage are updated every time an operational system performs a transaction.
The data warehouses then generate transactions that are passed back into the operational systems.hich
are the reference information that gives context to the facts. For example, a sales transaction can be
broken up into facts such as the number of products ordered and the price paid for the products, and
into dimensions such as order date, customer name, product number, order ship-to and bill-to
locations, and salesperson responsible for receiving the order. A key advantage of a dimensional
approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval
of data from the data warehouse tends to operate very quickly. The main disadvantages of the
dimensional approach are: 1) In order to maintain the integrity of facts and dimensions, loading the
data warehouse with data from different operational systems is complicated, and 2) It is difficult to
modify the data warehouse structure if the organization adopting the dimensional approach changes the
way in which it does business.
In the normalized approach, the data in the data warehouse are stored following, to a degree, the Codd
normalization rule. Tables are grouped together by subject areas that reflect general data categories
(e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it is
straightforward to add information into the database. A disadvantage of this approach is that, because
of the number of tables involved, it can be difficult for users both to 1) join data from different sources
into meaningful information and then 2) access the information without a precise understanding of the
sources of data and of the data structure of the data warehouse.
These approaches are not exact opposites of each other. Dimensional approaches can involve
normalizing data to a degree.
Fact table:
In data warehousing, a fact table consists of the measurements, metrics or facts of a business process.
It is often located at the centre of a star schema, surrounded by dimension tables.Fact tables provide
the (usually) additive values which act as independent variables by which dimensional attributes are
analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most
atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as
"Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined
by a day, product and store. Other dimensions might be members of this fact table (such as
location/region) but these add nothing to the uniqueness of the fact records. These "affiliate
dimensions" allow for additional slices of the independent facts but generally provide insights at a
higher level of aggregation (region is made up of many stores)
A data warehouse dimension provides the means to "slice and dice" data in a data warehouse.
Dimensions provide structured labeling information to otherwise unordered numeric measures. For
example, "Customer", "Date", and "Product" are all dimensions that could be applied meaningfully to a
sales receipt. A dimensional data element is similar to a categorical variable in statistics.
The primary function of dimensions is threefold: to provide filtering, grouping and labeling. For
example, in a data warehouse where each person is categorized as having a gender of male, female or
unknown, a user of the data warehouse would then be able to filter or categorize each presentation or
report by either filtering based on the gender dimension or displaying results broken out by the gender.
Star Schema:
The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse
schema. The star schema consists of a few "fact tables" (possibly only one, justifying the name)
referencing any number of "dimension tables". The star schema is considered an important special case
of the snowflake schema.
Example
Star schema used by example query.
Consider a database of sales, perhaps from a store chain, classified by date, store and product. The
image of the schema to the right is a star schema version of the sample schema provided in the
snowflake schema article.
Fact_Sales is the fact table and there are three dimension tables Dim_Date, Dim_Store and
Dim_Product.
Each dimension table has a primary key on its Id column, relating to one of the columns of the
Fact_Sales table's three-column primary key (Date_Id, Store_Id, Product_Id). The non-primary key
Units_Sold column of the fact table in this example represents a measure or metric that can be used in
calculations and analysis. The non-primary key columns of the dimension tables represent additional
attributes of the dimensions (such as the Year of the Dim_Date dimension).
The following query extracts how many TV sets have been sold, for each brand and country, in 1997.
Normalization:
Database normalization, sometimes referred to as canonical synthesis, is a technique for designing
relational database tables to minimize duplication of information and, in so doing, to safeguard the
database against certain types of logical or structural problems, namely data anomalies. For example,
when multiple instances of a given piece of information occur in a table, the possibility exists that these
instances will not be kept consistent when the data within the table is updated, leading to a loss of data
integrity. A table that is sufficiently normalized is less vulnerable to problems of this kind, because its
structure reflects the basic assumptions for when multiple instances of the same information should be
represented by a single instance only.
Higher degrees of normalization typically involve more tables and create the need for a larger number
of joins, which can reduce performance. Accordingly, more highly normalized tables are typically used
in database applications involving many isolated transactions (e.g. an automated teller machine), while
less normalized tables tend to be used in database applications that need to map complex relationships
between data entities and data attributes (e.g. a reporting application, or a full-text search application).
Database theory describes a table's degree of normalization in terms of normal forms of successively
higher degrees of strictness. A table in third normal form (3NF), for example, is consequently in second
normal form (2NF) as well; but the reverse is not necessarily the case.
Although the normal forms are often defined informally in terms of the characteristics of tables,
rigorous definitions of the normal forms are concerned with the characteristics of mathematical
constructs known as relations. Whenever information is represented relationally, it is meaningful to
consider the extent to which the representation is normalized.
materialised view:
In a database management system following the relational model, a view is a virtual table representing
the result of a database query. Whenever an ordinary view's table is queried or updated, the DBMS
converts these into queries or updates against the underlying base tables. A materialized view takes a
different approach in which the query result is cached as a concrete table that may be updated from the
original base tables from time to time. This enables much more efficient access, at the cost of some
data being potentially out-of-date. It is most useful in data warehousing scenarios, where frequent
queries of the actual base tables can be extremely expensive.
In addition, because the view is manifested as a real table, anything that can be done to a real table
can be done to it, most importantly building indexes on any column, enabling drastic speedups in query
time. In a normal view, it's typically only possible to exploit indexes on columns that come directly from
(or have a mapping to) indexed columns in the base tables; often this functionality is not offered at all.
A relational database is designed for a specific purpose. Because the purpose of a data warehouse
differs from that of an OLTP, the design characteristics of a relational database that supports a data
warehouse differ from the design characteristics of an OLTP database.
Optimized for bulk loads and large, Optimized for a common set of
complex, unpredictable queries that transactions, usually adding or retrieving a
access many rows per table single row at a time per table
Loaded with consistent, valid data; Optimized for validation of incoming data
requires no real time validation during transactions; uses validation data
tables
A data warehouse supports an OLTP system by providing a place for the OLTP database to offload data
as it accumulates, and by providing services that would complicate and degrade OLTP operations if they
were performed in the OLTP database.
Without a data warehouse to hold historical information, data is archived to static media such as
magnetic tape, or allowed to accumulate in the OLTP database.
If data is simply archived for preservation, it is not available or organized for use by analysts and
decision makers. If data is allowed to accumulate in the OLTP so it can be used for analysis, the OLTP
database continues to grow in size and requires more indexes to service analytical and report queries.
These queries access and process large portions of the continually growing historical data and add a
substantial load to the database. The large indexes needed to support these queries also tax the OLTP
transactions with additional index maintenance. These queries can also be complicated to develop due
to the typically complex OLTP database schema.
A data warehouse offloads the historical data from the OLTP, allowing the OLTP to operate at peak
transaction efficiency. High volume analytical and reporting queries are handled by the data warehouse
and do not load the OLTP, which does not need additional indexes for their support. As data is moved to
the data warehouse, it is also reorganized and consolidated so that analytical queries are simpler and
more efficient.
Online analytical processing (OLAP) is a technology designed to provide superior performance for ad
hoc business intelligence queries. OLAP is designed to operate efficiently with data organized in
accordance with the common dimensional model used in data warehouses.
A data warehouse provides a multidimensional view of data in an intuitive model designed to match the
types of queries posed by analysts and decision makers. OLAP organizes data warehouse data into
multidimensional cubes based on this dimensional model, and then preprocesses these cubes to provide
maximum performance for queries that summarize data in various ways. For example, a query that
requests the total sales income and quantity sold for a range of products in a specific geographical
region for a specific time period can typically be answered in a few seconds or less regardless of how
many hundreds of millions of rows of data are stored in the data warehouse database.
OLAP is not designed to store large volumes of text or binary data, nor is it designed to support high
volume update transactions. The inherent stability and consistency of historical data in a data
warehouse enables OLAP to provide its remarkable performance in rapidly summarizing information for
analytical queries.
In SQL Server 2000, Analysis Services provides tools for developing OLAP applications and a server
specifically designed to service OLAP queries.
Data mining is a technology that applies sophisticated and complex algorithms to analyze data and
expose interesting information for analysis by decision makers. Whereas OLAP organizes data in a
model suited for exploration by analysts, data mining performs analysis on data and provides the
results to decision makers. Thus, OLAP supports model-driven analysis and data mining supports data-
driven analysis.
Data mining has traditionally operated only on raw data in the data warehouse database or, more
commonly, text files of data extracted from the data warehouse database. In SQL Server 2000, Analysis
Services provides data mining technology that can analyze data in OLAP cubes, as well as data in the
relational data warehouse database. In addition, data mining results can be incorporated into OLAP
cubes to further enhance model-driven analysis by providing an additional dimensional viewpoint into
the OLAP model. For example, data mining can be used to analyze sales data against customer
attributes and create a new cube dimension to assist the analyst in the discovery of the information
embedded in the cube data.
For more information and details about data mining in SQL Server 2000, see Chapter 24, "Effective
Strategies for Data Mining," in the SQL Server 2000 Resource Kit.
Before embarking on the design of a data warehouse, it is imperative that the architectural goals of the
data warehouse be clear and well understood. Because the purpose of a data warehouse is to serve
users, it is also critical to understand the various types of users, their needs, and the characteristics of
their interactions with the data warehouse.
A data warehouse exists to serve its users—analysts and decision makers. A data warehouse must be
designed to satisfy the following requirements:
Provide a variety of powerful analytical tools, such as OLAP and data mining
Most successful data warehouses that meet these requirements have these common characteristics:
Data warehouses are often quite large. However, size is not an architectural goal—it is a characteristic
driven by the amount of data needed to serve the users.
The success of a data warehouse is measured solely by its acceptance by users. Without users,
historical data might as well be archived to magnetic tape and stored in the basement. Successful data
warehouse design starts with understanding the users and their needs.
Data warehouse users can be divided into four categories: Statisticians, Knowledge Workers,
Information Consumers, and Executives. Each type makes up a portion of the user population as
illustrated in this diagram.
Statisticians: There are typically only a handful of sophisticated analysts—Statisticians and operations
research types—in any organization. Though few in number, they are some of the best users of the
data warehouse; those whose work can contribute to closed loop systems that deeply influence the
operations and profitability of the company. It is vital that these users come to love the data
warehouse. Usually that is not difficult; these people are often very self-sufficient and need only to be
pointed to the database and given some simple instructions about how to get to the data and what
times of the day are best for performing large queries to retrieve data to analyze using their own
sophisticated tools. They can take it from there.
Knowledge Workers: A relatively small number of analysts perform the bulk of new queries and
analyses against the data warehouse. These are the users who get the "Designer" or "Analyst" versions
of user access tools. They will figure out how to quantify a subject area. After a few iterations, their
queries and reports typically get published for the benefit of the Information Consumers. Knowledge
Workers are often deeply engaged with the data warehouse design and place the greatest demands on
the ongoing data warehouse operations team for training and support.
Information Consumers: Most users of the data warehouse are Information Consumers; they will
probably never compose a true ad hoc query. They use static or simple interactive reports that others
have developed. It is easy to forget about these users, because they usually interact with the data
warehouse only through the work product of others. Do not neglect these users! This group includes a
large number of people, and published reports are highly visible. Set up a great communication
infrastructure for distributing information widely, and gather feedback from these users to improve the
information sites over time.
Executives: Executives are a special case of the Information Consumers group. Few executives
actually issue their own queries, but an executive's slightest musing can generate a flurry of activity
among the other types of users. A wise data warehouse designer/implementer/owner will develop a
very cool digital dashboard for executives, assuming it is easy and economical to do so. Usually this
should follow other data warehouse work, but it never hurts to impress the bosses.
Information for users can be extracted from the data warehouse relational database or from the output
of analytical services such as OLAP or data mining. Direct queries to the data warehouse relational
database should be limited to those that cannot be accomplished through existing tools, which are often
more efficient than direct queries and impose less load on the relational database.
Reporting tools and custom applications often access the database directly. Statisticians frequently
extract data for use by special analytical tools. Analysts may write complex queries to extract and
compile specific information not readily accessible through existing tools. Information consumers do not
interact directly with the relational database but may receive e-mail reports or access web pages that
expose data from the relational database. Executives use standard reports or ask others to create
specialized reports for them.
When using the Analysis Services tools in SQL Server 2000, Statisticians will often perform data mining,
Analysts will write MDX queries against OLAP cubes and use data mining, and Information Consumers
will use interactive reports designed by others.
The phases of a data warehouse project listed below are similar to those of most database projects,
starting with identifying requirements and ending with deploying the system:
Identify sponsors. A successful data warehouse project needs a sponsor in the business organization
and usually a second sponsor in the Information Technology group. Sponsors must understand and
support the business value of the project.
Understand the business before entering into discussions with users. Then interview and work with the
users, not the data—learn the needs of the users and turn these needs into project requirements. Find
out what information they need to be more successful at their jobs, not what data they think should be
in the data warehouse; it is the data warehouse designer's job to determine what data is necessary to
provide the information. Topics for discussion are the users' objectives and challenges and how they go
about making business decisions. Business users should be closely tied to the design team during the
logical design process; they are the people who understand the meaning of existing data. Many
successful projects include several business users on the design team to act as data experts and
"sounding boards" for design concepts. Whatever the structure of the team, it is important that
business users feel ownership for the resulting system.
Interview data experts after interviewing several users. Find out from the experts what data exists and
where it resides, but only after you understand the basic business needs of the end users. Information
about available data is needed early in the process, before you complete the analysis of the business
needs, but the physical design of existing data should not be allowed to have much influence on
discussions about business needs.
User requirements and data realities drive the design of the dimensional model, which must address
business needs, grain of detail, and what dimensions and facts to include.
The dimensional model must suit the requirements of the users and support ease of use for direct
access. The model must also be designed so that it is easy to maintain and can adapt to future
changes. The model design must result in a relational database that supports OLAP cubes to provide
"instantaneous" query results for analysts.
An OLTP system requires a normalized structure to minimize redundancy, provide validation of input
data, and support a high volume of fast transactions. A transaction usually involves a single business
event, such as placing an order or posting an invoice payment. An OLTP model often looks like a spider
web of hundreds or even thousands of related tables.
In contrast, a typical dimensional model uses a star or snowflake design that is easy to understand and
relate to business needs, supports simplified business queries, and provides superior query
performance by minimizing table joins.
For example, contrast the very simplified OLTP data model in the first diagram below with the data
warehouse dimensional model in the second diagram. Which one better supports the ease of developing
reports and simple, efficient summarization queries?
The principal characteristic of a dimensional model is a set of detailed business facts surrounded by
multiple dimensions that describe those facts. When realized in a database, the schema for a
dimensional model contains a central fact table and multiple dimension tables. A dimensional model
Star Schemas
A schema is called a star schema if all dimension tables can be joined directly to the fact table. The
following diagram shows a classic star schema.
Snowflake Schemas
A schema is called a snowflake schema if one or more dimension tables do not join directly to the fact
table but must join through other dimension tables. For example, a dimension that describes products
may be separated into three tables (snowflaked) as illustrated in the following diagram.
A snowflake schema with multiple heavily snowflaked dimensions is illustrated in the following diagram.
Star or Snowflake
Both star and snowflake schemas are dimensional models; the difference is in their physical
implementations. Snowflake schemas support ease of dimension maintenance because they are more
normalized. Star schemas are easier for direct user access and often support simpler and more efficient
queries. The decision to model a dimension as a star or snowflake depends on the nature of the
dimension itself, such as how frequently it changes and which of its elements change, and often
involves evaluating tradeoffs between ease of use and ease of maintenance. It is often easiest to
maintain a complex dimension by snow flaking the dimension. By pulling hierarchical levels into
separate tables, referential integrity between the levels of the hierarchy is guaranteed. Analysis
Services reads from a snowflaked dimension as well as, or better than, from a star dimension.
However, it is important to present a simple and appealing user interface to business users who are
developing ad hoc queries on the dimensional database. It may be better to create a star version of the
snowflaked dimension for presentation to the users. Often, this is best accomplished by creating an
indexed view across the snowflaked dimension, collapsing it to a virtual star.
Dimension Tables
Dimension tables encapsulate the attributes associated with facts and separate these attributes into
logically distinct groupings, such as time, geography, products, customers, and so forth.
A dimension table may be used in multiple places if the data warehouse contains multiple fact tables or
contributes data to data marts. For example, a product dimension may be used with a sales fact table
and an inventory fact table in the data warehouse, and also in one or more departmental data marts. A
dimension such as customer, time, or product that is used in multiple schemas is called a conforming
dimension if all copies of the dimension are the same. Summarization data and reports will not
correspond if different schemas use different versions of a dimension table. Using conforming
dimensions is critical to successful data warehouse design.
User input and evaluation of existing business reports help define the dimensions to include in the data
warehouse. A user who wants to see data "by sales region" and "by product" has just identified two
dimensions (geography and product). Business reports that group sales by salesperson or sales by
customer identify two more dimensions (salesforce and customer). Almost every data warehouse
includes a time dimension.
In contrast to a fact table, dimension tables are usually small and change relatively slowly. Dimension
tables are seldom keyed to date.
The records in a dimension table establish one-to-many relationships with the fact table. For example,
there may be a number of sales to a single customer, or a number of sales of a single product. The
dimension table contains attributes associated with the dimension entry; these attributes are rich and
user-oriented textual details, such as product name or customer name and address. Attributes serve as
report labels and query constraints. Attributes that are coded in an OLTP database should be decoded
into descriptions. For example, product category may exist as a simple integer in the OLTP database,
but the dimension table should contain the actual text for the category. The code may also be carried in
the dimension table if needed for maintenance. This denormalization simplifies and improves the
efficiency of queries and simplifies user query tools. However, if a dimension attribute changes
frequently, maintenance may be easier if the attribute is assigned to its own table to create a snowflake
dimension.
It is often useful to have a pre-established "no such member" or "unknown member" record in each
dimension to which orphan fact records can be tied during the update process. Business needs and the
reliability of consistent source data will drive the decision as to whether such placeholder dimension
records are required.
Hierarchies
The data in a dimension is usually hierarchical in nature. Hierarchies are determined by the business
need to group and summarize data into usable information. For example, a time dimension often
contains the hierarchy elements: (all time), Year, Quarter, Month, Day, or (all time), Year Quarter,
Week, Day. A dimension may contain multiple hierarchies—a time dimension often contains both
calendar and fiscal year hierarchies. Geography is seldom a dimension of its own; it is usually a
hierarchy that imposes a structure on sales points, customers, or other geographically distributed
dimensions. An example geography hierarchy for sales points is: (all), Country or Region, Sales-region,
State or Province, City, Store.
Note that each hierarchy example has an "(all)" entry such as (all time), (all stores), (all customers),
and so forth. This top-level entry is an artificial category used for grouping the first-level categories of a
dimension and permits summarization of fact data to a single number for a dimension. For example, if
the first level of a product hierarchy includes product line categories for hardware, software,
peripherals, and services, the question "What was the total amount for sales of all products last year?"
is equivalent to "What was the total amount for the combined sales of hardware, software, peripherals,
and services last year?" The concept of an "(all)" node at the top of each hierarchy helps reflect the way
users want to phrase their questions. OLAP tools depend on hierarchies to categorize data—Analysis
Services will create by default an "(all)" entry for a hierarchy used in a cube if none is specified.
Surrogate Keys
A critical part of data warehouse design is the creation and use of surrogate keys in dimension tables. A
surrogate key is the primary key for a dimension table and is independent of any keys provided by
source data systems. Surrogate keys are created and maintained in the data warehouse and should not
encode any information about the contents of records; automatically increasing integers make good
surrogate keys. The original key for each record is carried in the dimension table but is not used as the
primary key. Surrogate keys provide the means to maintain data warehouse information when
dimensions change. Special keys are used for date and time dimensions, but these keys differ from
surrogate keys used for other dimension tables.
Avoid using GUIDs (globally unique identifiers) as keys in the data warehouse database. GUIDs may be
used in data from distributed source systems, but they are difficult to use as table keys. GUIDs use a
significant amount of storage (16 bytes each), cannot be efficiently sorted, and are difficult for humans
to read. Indexes on GUID columns may be relatively slower than indexes on integer keys because
GUIDs are four times larger. The Transact-SQL NEWID function can be used to create GUIDs for a
column of uniqueidentifier data type, and the ROWGUIDCOL property can be set for such a column to
indicate that the GUID values in the column uniquely identify rows in the table, but uniqueness is not
enforced.
Because a uniqueidentifier data type cannot be sorted, the GUID cannot be used in a GROUP BY
statement, nor can the occurrences of the uniqueidentifierGUID be distinctly counted—both GROUP
BY and COUNT DISTINCT operations are very common in data warehouses. The uniqueidentifier GUID
cannot be used as a measure in an Analysis Services cube.
The IDENTITY property and IDENTITY function can be used to create identity columns in tables and to
manage series of generated numeric keys. IDENTITY functionality is more useful in surrogate key
management than uniqueidentifier GUIDs.
Each event in a data warehouse occurs at a specific date and time; and data is often summarized by a
specified time period for analysis. Although the date and time of a business fact is usually recorded in
the source data, special date and time dimensions provide more effective and efficient mechanisms for
time-oriented analysis than the raw event time stamp. Date and time dimensions are designed to meet
the needs of the data warehouse users and are created within the data warehouse.
A date dimension often contains two hierarchies: one for calendar year and another for fiscal year.
Time Granularity
A date dimension with one record per day will suffice if users do not need time granularity finer than a
single day. A date by day dimension table will contain 365 records per year (366 in leap years).
A separate time dimension table should be constructed if a fine time granularity, such as minute or
second, is needed. A time dimension table of one-minute granularity will contain 1,440 rows for a day,
and a table of seconds will contain 86,400 rows for a day. If exact event time is needed, it should be
When a separate time dimension is used, the fact table contains one foreign key for the date dimension
and another for the time dimension. Separate date and time dimensions simplify many filtering
operations. For example, summarizing data for a range of days requires joining only the date dimension
table to the fact table. Analyzing cyclical data by time period within a day requires joining just the time
dimension table. The date and time dimension tables can both be joined to the fact table when a
specific time range is needed.
For hourly time granularity, the hour breakdown can be incorporated into the date dimension or placed
in a separate dimension. Business needs influence this design decision. If the main use is to extract
contiguous chunks of time that cross day boundaries (for example 11/24/2000 10 p.m. to 11/25/2000
6 a.m.), then it is easier if the hour and day are in the same dimension. However, it is easier to analyze
cyclical and recurring daily events if they are in separate dimensions. Unless there is a clear reason to
combine date and hour in a single dimension, it is generally better to keep them in separate
dimensions.
It is often useful to maintain attribute columns in a date dimension to provide additional convenience or
business information that supports analysis. For example, one or more columns in the time-by-hour
dimension table can indicate peak periods in a daily cycle, such as meal times for a restaurant chain or
heavy usage hours for an Internet service provider. Peak period columns may be Boolean, but it is
better to "decode" the Boolean yes/no into a brief description, such as "peak"/"offpeak". In a report,
the decoded values will be easier for business users to read than multiple columns of "yes" and "no".
These are some possible attribute columns that may be used in a date table. Fiscal year versions are
the same, although values such as quarter numbers may differ.
Format/Exam
Column name Data type ple Comment
day_date smalldatetime
week_begin_dat smalldatetime
e
month_num tinyint 1 to 12
quarter_num tinyint 1 to 4
year smallint
holiday_ind bit
I/O performance should always be a key consideration for data warehouse designers and
administrators. The typical workload in a data warehouse is especially I/O intensive, with operations
such as large data loads and index builds, creation of materialized views, and queries over large
volumes of data. The underlying I/O system for a data warehouse should be designed to meet these
heavy requirements.
In fact, one of the leading causes of performance issues in a data warehouse is poor I/O configuration.
Database administrators who have previously managed other systems will likely need to pay more
careful attention to the I/O configuration for a data warehouse than they may have previously done for
other environments.
This chapter provides the following five high-level guidelines for data-warehouse I/O configurations:
Use Redundancy
The I/O configuration used by a data warehouse will depend on the characteristics of the specific
storage and server capabilities, so the material in this chapter is only intended to provide guidelines for
designing and tuning an I/O system.
Storage configurations for a data warehouse should be chosen based on the I/O bandwidth that they
can provide, and not necessarily on their overall storage capacity. Buying storage based solely on
capacity has the potential for making a mistake, especially for systems less than 500GB is total size.
The capacity of individual disk drives is growing faster than the I/O throughput rates provided by those
disks, leading to a situation in which a small number of disks can store a large volume of data, but
cannot provide the same I/O throughput as a larger number of small disks.
As an example, consider a 200GB data mart. Using 72GB drives, this data mart could be built with as
few as six drives in a fully-mirrored environment. However, six drives might not provide enough I/O
bandwidth to handle a medium number of concurrent users on a 4-CPU server. Thus, even though six
drives provide sufficient storage, a larger number of drives may be required to provide acceptable
performance for this system.
While it may not be practical to estimate the I/O bandwidth that will be required by a data warehouse
before a system is built, it is generally practical with the guidance of the hardware manufacturer to
estimate how much I/O bandwidth a given server can potentially utilize, and ensure that the selected
I/O configuration will be able to successfully feed the server. There are many variables in sizing the I/O
systems, but one basic rule of thumb is that your data warehouse system should have multiple disks for
each CPU (at least two disks for each CPU at a bare minimum) in order to achieve optimal performance.
The guiding principle in configuring an I/O system for a data warehouse is to maximize I/O bandwidth
by having multiple disks and channels access each database object. You can do this by striping the
datafiles of the Oracle Database. A striped file is a file distributed across multiple disks. This striping can
be managed by software (such as a logical volume manager), or within the storage hardware. The goal
is to ensure that each tablespace is striped across a large number of disks (ideally, all of the disks) so
that any database object can be accessed with the highest possible I/O bandwidth.
Use Redundancy
Because data warehouses are often the largest database systems in a company, they have the most
disks and thus are also the most susceptible to the failure of a single disk. Therefore, disk redundancy
is a requirement for data warehouses to protect against a hardware failure. Like disk-striping,
redundancy can be achieved in many ways using software or hardware.
A key consideration is that occasionally a balance must be made between redundancy and performance.
For example, a storage system in a RAID-5 configuration may be less expensive than a RAID-0+1
configuration, but it may not perform as well, either. Redundancy is necessary for any data warehouse,
but the approach to redundancy may vary depending upon the performance and cost constraints of
each data warehouse.
The most important time to examine and tune the I/O system is before the database is even created.
Once the database files are created, it is more difficult to reconfigure the files. Some logical volume
managers may support dynamic reconfiguration of files, while other storage configurations may require
that files be entirely rebuilt in order to reconfigure their I/O layout. In both cases, considerable system
resources must be devoted to this reconfiguration.
When creating a data warehouse on a new system, the I/O bandwidth should be tested before creating
all of the database datafiles to validate that the expected I/O levels are being achieved. On most
operating systems, this can be done with simple scripts to measure the performance of reading and
writing large test files.
A data warehouse designer should plan for future growth of a data warehouse. There are many
approaches to handling the growth in a system, and the key consideration is to be able to grow the I/O
system without compromising on the I/O bandwidth. You cannot, for example, add four disks to an
existing system of 20 disks, and grow the database by adding a new tablespace striped across only the
four new disks. A better solution would be to add new tablespaces striped across all 24 disks, and over
time also convert the existing tablespaces striped across 20 disks to be striped across all 24 disks.
Storage Management
Two features to consider for managing disks are Oracle Managed Files and Automatic Storage
Management. Without these features, a database administrator must manage the database files, which,
in a data warehouse, can be hundreds or even thousands of files. Oracle Managed Files simplifies the
administration of a database by providing functionality to automatically create and manage files, so the
database administrator no longer needs to manage each database file. Automatic Storage Management
provides additional functionality for managing not only files but also the disks. With Automatic Storage
Management, the database administrator would administer a small number of disk groups. Automatic
Storage Management handles the tasks of striping and providing disk redundancy, including rebalancing
the database files when new disks are added to the system.
Data parallelism:
Data parallelism (also known as loop-level parallelism) is a form of parallelization of computing across
multiple processors in parallel computing environments. Data parallelism focuses on distributing the
data across different parallel computing nodes. It contrasts to task parallelism as another form of
parallelism.
In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved
when each processor performs the same task on different pieces of distributed data. In some situations,
a single execution thread controls operations on all pieces of data. In others, different threads control
the operation, but they execute the same code.
For instance, if we are running code on a 2-processor system (CPUs A and B) in a parallel environment,
and we wish to do a task on some data D, it is possible to tell CPU A to do that task on one part of D
and CPU B on another part simultaneously, thereby reducing the runtime of the execution. The data can
be assigned using conditional statements. As a specific example, consider adding two matrices. In a
data parallel implementation, CPU A could add all elements from the top half of the matrices, while CPU
B could add all elements from the bottom half of the matrices. Since the two processors work in
parallel, the job of performing matrix addition would take one half the time of performing the same
operation in serial using one CPU alone.
Data parallelism emphasizes the distributed (parallelized) nature of the data, as opposed to the
processing (task parallelism). Most real programs fall somewhere on a continuum between Task
parallelism and Data parallelism.-
"Data Warehouse Design Considerations," discussed the use of dimensional modeling to design
databases for data warehousing. In contrast to the complex, highly normalized, entity-relationship
schemas of online transaction processing (OLTP) databases, data warehouse schemas are simple and
denormalized. Regardless of the specific design or technology used in a data warehouse, its
implementation must include mechanisms to migrate data into the data warehouse database. This
process of data migration is generally referred to as the extraction, transformation, and loading (ETL)
process.
Some data warehouse experts add an additional term—management—to ETL, expanding it to ETLM.
Others use the M to mean meta data. Both refer to the management of the data as it flows into the
data warehouse and is used in the data warehouse. The information used to manage data consists of
data about data, which is the definition of meta data.
The topics in this chapter describe the elements of the ETL process and provide examples of procedures
that address common ETL issues such as managing surrogate keys, slowly changing dimensions, and
meta data.
The code examples in this chapter are also available on the SQL Server 2000 Resource Kit CD-ROM, in
the file \Docs\ChapterCode\CH19Code.txt. For more information, see Chapter 39, "Tools, Samples,
eBooks, and More."
Introduction
During the ETL process, data is extracted from an OLTP database, transformed to match the data
warehouse schema, and loaded into the data warehouse database. Many data warehouses also
incorporate data from non-OLTP systems, such as text files, legacy systems, and spreadsheets; such
data also requires extraction, transformation, and loading.
In its simplest form, ETL is the process of copying data from one database to another. This simplicity is
rarely, if ever, found in data warehouse implementations; in reality, ETL is often a complex combination
of process and technology that consumes a significant portion of the data warehouse development
efforts and requires the skills of business analysts, database designers, and application developers.
When defining ETL for a data warehouse, it is important to think of ETL as a process, not a physical
implementation. ETL systems vary from data warehouse to data warehouse and even between
department data marts within a data warehouse. A monolithic application, regardless of whether it is
implemented in Transact-SQL or a traditional programming language, does not provide the flexibility for
change necessary in ETL systems. A mixture of tools and technologies should be used to develop
applications that each perform a specific ETL task.
The ETL process is not a one-time event; new data is added to a data warehouse periodically. Typical
periodicity may be monthly, weekly, daily, or even hourly, depending on the purpose of the data
warehouse and the type of business it serves. Because ETL is an integral, ongoing, and recurring part of
a data warehouse, ETL processes must be automated and operational procedures documented. ETL also
changes and evolves as the data warehouse evolves, so ETL processes must be designed for ease of
modification. A solid, well-designed, and documented ETL system is necessary for the success of a data
warehouse project.
Data warehouses evolve to improve their service to the business and to adapt to changes in business
processes and requirements. Business rules change as the business reacts to market influences—the
data warehouse must respond in order to maintain its value as a tool for decision makers. The ETL
implementation must adapt as the data warehouse evolves.
Microsoft® SQL Server™ 2000 provides significant enhancements to existing performance and
capabilities, and introduces new features that make the development, deployment, and maintenance of
ETL processes easier and simpler, and its performance faster.
Regardless of how they are implemented, all ETL systems have a common purpose: they move data
from one database to another. Generally, ETL systems move data from OLTP systems to a data
warehouse, but they can also be used to move data from one data warehouse to another. An ETL
system consists of four distinct functional elements:
• Extraction
• Transformation
• Loading
• Meta data
Extraction
The ETL extraction element is responsible for extracting data from the source system. During
extraction, data may be removed from the source system or a copy made and the original data retained
in the source system. It is common to move historical data that accumulates in an operational OLTP
system to a data
warehouse to maintain OLTP performance and efficiency. Legacy systems may require too much effort
to implement such offload processes, so legacy data is often copied into the data warehouse, leaving
the original data in place. Extracted data is loaded into the data warehouse staging area (a relational
database usually separate from the data warehouse database), for manipulation by the remaining ETL
processes.
Data extraction is generally performed within the source system itself, especially if it is a relational
database to which extraction procedures can easily be added. It is also possible for the extraction logic
to exist in the data warehouse staging area and query the source system for data using ODBC, OLE DB,
or other APIs. For legacy systems, the most common method of data extraction is for the legacy system
to produce text files, although many newer systems offer direct query APIs or accommodate access
through ODBC or OLE DB.
Data extraction processes can be implemented using Transact-SQL stored procedures, Data
Transformation Services (DTS) tasks, or custom applications developed in programming or scripting
languages.
Transformation
The ETL transformation element is responsible for data validation, data accuracy, data type conversion,
and business rule application. It is the most complicated of the ETL elements. It may appear to be more
efficient to perform some transformations as the data is being extracted (inline transformation);
however, an ETL system that uses inline transformations during extraction is less robust and flexible
than one that confines transformations to the transformation element. Transformations performed in
the OLTP system impose a performance burden on the OLTP database. They also split the
transformation logic between two ETL elements and add maintenance complexity when the ETL logic
changes.
Tools used in the transformation element vary. Some data validation and data accuracy checking can be
accomplished with straightforward Transact-SQL code. More complicated transformations can be
implemented using DTS packages. The application of complex business rules often requires the
development of sophisticated custom applications in various programming languages. You can use DTS
packages to encapsulate multi-step transformations into a single task.
Listed below are some basic examples that illustrate the types of transformations performed by this
element:
Data Validation
Check that all rows in the fact table match rows in dimension tables to enforce data integrity.
Data Accuracy
Ensure that fields contain appropriate values, such as only "off" or "on" in a status field.
Ensure that all values for a specified field are stored the same way in the data warehouse regardless of
how they were stored in the source system. For example, if one source system stores "off" or "on" in its
status field and another source system stores "0" or "1" in its status field, then a data type conversion
transformation converts the content of one or both of the fields to a specified common value such as
"off" or "on".
Ensure that the rules of the business are enforced on the data stored in the warehouse. For example,
check that all customer records contain values for both FirstName and LastName fields.
Loading
The ETL loading element is responsible for loading transformed data into the data warehouse database.
Data warehouses are usually updated periodically rather than continuously, and large numbers of
records are often loaded to multiple tables in a single data load. The data warehouse is often taken
offline during update operations so that data can be loaded faster and SQL Server 2000 Analysis
Services can update OLAP cubes to incorporate the new data. BULK INSERT, bcp, and the Bulk Copy
API are the best tools for data loading operations. The design of the loading element should focus on
efficiency and performance to minimize the data warehouse offline time. For more information and
details about performance tuning, see Chapter 20, "RDBMS Performance Tuning Guide for Data
Warehousing."
Meta Data
The ETL meta data functional element is responsible for maintaining information (meta data) about the
movement and transformation of data, and the operation of the data warehouse. It also documents the
data mappings used during the transformations. Meta data logging provides possibilities for automated
administration, trend prediction, and code reuse.
Examples of data warehouse meta data that can be recorded and used to analyze the activity and
performance of a data warehouse include:
• Data Lineage, such as the time that a particular set of records was loaded
into the data warehouse.
• Data Type Usage, such as identifying all tables that use the "Birthdate" user-
defined data type.
• DTS Package Versioning, which can be used to view, branch, or retrieve any
historical version of a particular DTS package.
Regardless of their implementation, a number of design considerations are common to all ETL systems:
Modularity
ETL systems should contain modular elements that perform discrete tasks. This encourages reuse and
makes them easy to modify when implementing changes in response to business and data warehouse
changes. Monolithic systems should be avoided.
Consistency
ETL systems should guarantee consistency of data when it is loaded into the data warehouse. An entire
data load should be treated as a single logical transaction—either the entire data load is successful or
the entire load is rolled back. In some systems, the load is a single physical transaction, whereas in
others it is a series of transactions. Regardless of the physical implementation, the data load should be
treated as a single logical transaction.
Flexibility
ETL systems should be developed to meet the needs of the data warehouse and to accommodate the
source data environments. It may be appropriate to accomplish some transformations in text files and
some on the source data system; others may require the development of custom applications. A variety
of technologies and techniques can be applied, using the tool most appropriate to the individual task of
each ETL functional element.
Speed
ETL systems should be as fast as possible. Ultimately, the time window available for ETL processing is
governed by data warehouse and source system schedules. Some data warehouse elements may have
a huge processing window (days), while others may have a very limited processing window (hours).
Regardless of the time available, it is important that the ETL system execute as rapidly as possible.
Heterogeneity
ETL systems should be able to work with a wide variety of data in different formats. An ETL system that
only works with a single type of source data is useless.
ETL systems are arguably the single most important source of meta data about both the data in the
data warehouse and data in the source system. Finally, the ETL process itself generates useful meta
data that should be retained and analyzed regularly. Meta data is discussed in greater detail later in this
chapter.
ETL Architectures
Before discussing the physical implementation of ETL systems, it is important to understand the
different ETL architectures and how they relate to each other. Essentially, ETL systems can be classified
in two architectures: the homogenous architecture and the heterogeneous architecture.
Homogenous Architecture
A homogenous architecture for an ETL system is one that involves only a single source system and a
single target system. Data flows from the single source of data through the ETL processes and is loaded
into the data warehouse, as shown in the following diagram.
• Single data source: Data is extracted from a single source system, such as an
OLTP system.
• Light structural transformation: Because the data comes from a single source,
the amount of structural changes such as table alteration is also very light. The
structural changes typically involve denormalization efforts to meet data
warehouse schema requirements.
The homogeneous ETL architecture is generally applicable to data marts, especially those focused on a
single subject matter.
Heterogeneous Architecture
A heterogeneous architecture for an ETL system is one that extracts data from multiple sources, as
shown in the following diagram. The complexity of this architecture arises from the fact that data from
more than one source must be merged, rather than from the fact that data may be formatted
differently in the different sources. However, significantly different storage formats and database
schemas do provide additional complications.
Heterogeneous ETL architectures are found more often in data warehouses than in data marts.
ETL Development
ETL development consists of two general phases: identifying and mapping data, and developing
functional element implementations. Both phases should be carefully documented and stored in a
central, easily accessible location, preferably in electronic form.
This phase of the development process identifies sources of data elements, the targets for those data
elements in the data warehouse, and the transformations that must be applied to each data element as
it is migrated from its source to its destination. High level data maps should be developed during the
requirements gathering and data modeling phases of the data warehouse project. During the ETL
system design and development process, these high level data maps are extended to thoroughly specify
system details.
For some systems, identifying the source data may be as simple as identifying the server where the
data is stored in an OLTP database and the storage type (SQL Server database, Microsoft Excel
spreadsheet, or text file, among others). In other systems, identifying the source may mean preparing
a detailed definition of the meaning of the data, such as a business rule, a definition of the data itself,
such as decoding rules (O = On, for example), or even detailed documentation of a source system for
which the system documentation has been lost or is not current.
Each data element is destined for a target in the data warehouse. A target for a data element may be
an attribute in a dimension table, a numeric measure in a fact table, or a summarized total in an
aggregation table. There may not be a one-to-one correspondence between a source data element and
a data element in the data warehouse because the destination system may not contain the data at the
same granularity as the source system. For example, a retail client may decide to roll data up to the
SKU level by day rather than track individual line item data. The level of item detail that is stored in the
fact table of the data warehouse is called the grain of the data. If the grain of the target does not match
the grain of the source, the data must be summarized as it moves from the source to the target.
A data map defines the source fields of the data, the destination fields in the data warehouse and any
data modifications that need to be accomplished to transform the data into the desired format for the
data warehouse. Some transformations require aggregating the source data to a coarser granularity,
such as summarizing individual item sales into daily sales by SKU. Other transformations involve
altering the source data itself as it moves from the source to the target. Some transformations decode
data into human readable form, such as replacing "1" with "on" and "0" with "off" in a status field. If
two source systems encode data destined for the same target differently (for example, a second source
system uses Yes and No for status), a separate transformation for each source system must be defined.
Transformations must be documented and maintained in the data maps. The relationship between the
source and target systems is maintained in a map that is referenced to execute the transformation of
the data before it is loaded in the data warehouse.
Design and implementation of the four ETL functional elements, Extraction, Transformation, Loading,
and meta data logging, vary from system to system. There will often be multiple versions of each
functional element.
Each functional element contains steps that perform individual tasks, which may execute on one of
several systems, such as the OLTP or legacy systems that contain the source data, the staging area
database, or the data warehouse database. Various tools and techniques may be used to implement the
steps in a single functional area, such as Transact-SQL, DTS packages, or custom applications
developed in a programming language such as Microsoft Visual Basic®. Steps that are discrete in one
functional element may be combined in another.
Extraction
The extraction element may have one version to extract data from one OLTP data source, a different
version for a different OLTP data source, and multiple versions for legacy systems and other sources of
data. This element may include tasks that execute SELECT queries from the ETL staging database
against a source OLTP system, or it may execute some tasks on the source system directly and others
in the staging database, as in the case of generating a flat file from a legacy system and then importing
it into tables in the ETL database. Regardless of methods or number of steps, the extraction element is
responsible for extracting the required data from the source system and making it available for
processing by the next element.
Transformation
Frequently a number of different transformations, implemented with various tools or techniques, are
required to prepare data for loading into the data warehouse. Some transformations may be performed
as data is extracted, such as an application on a legacy system that collects data from various internal
files as it produces a text file of data to be further transformed. However, transformations are best
accomplished in the ETL staging database, where data from several data sources may require varying
transformations specific to the incoming data organization and format.
Data from a single data source usually requires different transformations for different portions of the
incoming data. Fact table data transformations may include summarization, and will always require
surrogate dimension keys to be added to the fact records. Data destined for dimension tables in the
data warehouse may require one process to accomplish one type of update to a changing dimension
and a different process for another type of update.
Regardless of the number and variety of transformations and their implementations, the transformation
element is responsible for preparing data for loading into the data warehouse.
Loading
The loading element typically has the least variety of task implementations. After the data from the
various data sources has been extracted, transformed, and combined, the loading operation consists of
inserting records into the various data warehouse database dimension and fact tables. Implementation
may vary in the loading tasks, such as using BULK INSERT, bcp, or the Bulk Copy API. The loading
element is responsible for loading data into the data warehouse database tables.
Meta data is collected from a number of the ETL operations. The meta data logging implementation for
a particular ETL task will depend on how the task is implemented. For a task implemented by using a
custom application, the application code may produce the meta data. For tasks implemented by using
Transact-SQL, meta data can be captured with Transact-SQL statements in the task processes. The
meta data logging element is responsible for capturing and recording meta data that documents the
operation of the ETL functional areas and tasks, which includes identification of data that moves
through the ETL system as well as the efficiency of ETL tasks.
Common Tasks
Each ETL functional element should contain tasks that perform the following functions, in addition to
tasks specific to the functional area itself:
Confirm Success or Failure. A confirmation should be generated on the success or failure of the
execution of the ETL processes. Ideally, this mechanism should exist for each task so that rollback
mechanisms can be implemented to allow for incremental responses to errors.
Scheduling. ETL tasks should include the ability to be scheduled for execution. Scheduling mechanisms
reduce repetitive manual operations and allow for maximum use of system resources during recurring
periods of low activity.
Data Mining
Data Mining is an analytic process designed to explore data (usually large amounts of data - typically
business or market related) in search of consistent patterns and/or systematic relationships between
variables, and then to validate the findings by applying the detected patterns to new subsets of data.
The ultimate goal of data mining is prediction - and predictive data mining is the most common
type of data mining and one that has the most direct business applications. The process of data mining
consists of three stages: (1) the initial exploration, (2) model building or pattern identification with
validation/verification, and (3) deployment (i.e., the application of the model to new data in
order to generate predictions).
of data mining may involve anywhere between a simple choice of straightforward predictors for a
regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical
methods (see Exploratory Data Analysis (EDA)) in order to identify the most relevant variables
and determine the complexity and/or the general nature of models that can be taken into account in
the next stage.
Stage 2: Model building and validation. This stage involves considering various models and
choosing the best one based on their predictive performance (i.e., explaining the variability in question
and producing stable results across samples). This may sound like a simple operation, but in fact, it
sometimes involves a very elaborate process. There are a variety of techniques developed to achieve
that goal - many of which are based on so-called "competitive evaluation of models," that is, applying
different models to the same data set and then comparing their performance to choose the best. These
techniques - which are often considered the core of predictive data mining - include: Bagging
(Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning.
Stage 3: Deployment. That final stage involves using the model selected as best in the previous stage
and applying it to new data in order to generate predictions or estimates of the expected outcome.
The concept of Data Mining is becoming increasingly popular as a business information management
tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited
certainty. Recently, there has been increased interest in developing new analytic techniques specifically
designed to address the issues relevant to business Data Mining (e.g., Classification Trees), but
Data Mining is still based on the conceptual principles of statistics including the traditional Exploratory
Data Analysis (EDA) and modeling and it shares with them both some components of its general
approaches and specific techniques.
However, an important general difference in the focus and purpose between Data Mining and the
traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented towards
applications than the basic nature of the underlying phenomena. In other words, Data Mining is
relatively less concerned with identifying the specific relations between the involved variables. For
example, uncovering the nature of the underlying functions or the specific types of interactive,
multivariate dependencies between variables are not the main goal of Data Mining. Instead, the focus is
on producing a solution that can generate useful predictions. Therefore, Data Mining accepts among
others a "black box" approach to data exploration or knowledge discovery and uses not only the
traditional Exploratory Data Analysis (EDA) techniques, but also such techniques as Neural
Networks which can generate valid predictions but are not capable of identifying the specific nature of
the interrelations between the variables on which the predictions are based.
Data Mining is often considered to be "a blend of statistics, AI [artificial intelligence], and data base
research" (Pregibon, 1997, p. 8), which until very recently was not commonly recognized as a field of
interest for statisticians, and was even considered by some "a dirty word in Statistics" (Pregibon, 1997,
p. 8). Due to its applied importance, however, the field emerges as a rapidly growing and major area
(also in statistics) where important theoretical advances are being made (see, for example, the recent
annual International Conferences on Knowledge Discovery and Data Mining, co-hosted by the American
Statistical Association).
For information on Data Mining techniques, please review the summary topics included below in this
chapter of the Electronic Statistics Textbook. There are numerous books that review the theory and
practice of data mining; the following books offer a representative sample of recent general books on
Berry, M., J., A., & Linoff, G., S., (2000). Mastering data mining. New York: Wiley.
Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery (3rd ed). Potomac, MD:
Two Crows Corp.
Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge
discovery & data mining. Cambridge, MA: MIT Press.
Han, J., Kamber, M. (2000). Data mining: Concepts and Techniques. New York: Morgan-Kaufman.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning : Data mining,
inference, and prediction. New York: Springer.
Weiss, S. M., & Indurkhya, N. (1997). Predictive data mining: A practical guide. New York: Morgan-
Kaufman.
Westphal, C., Blaxton, T. (1998). Data mining solutions. New York: Wiley.
Witten, I. H., & Frank, E. (2000). Data mining. New York: Morgan-Kaufmann.
Boosting
The concept of boosting applies to the area of predictive data mining, to generate multiple models
or classifiers (for prediction or classification), and to derive weights to combine the predictions from
those models into a single prediction or predicted classification (see also Bagging).
A simple algorithm for boosting works like this: Start by applying some method (e.g., a tree classifier
such as C&RT or CHAID) to the learning data, where each observation is assigned an equal weight.
Compute the predicted classifications, and apply weights to the observations in the learning sample that
are inversely proportional to the accuracy of the classification. In other words, assign greater weight to
those observations that were difficult to classify (where the misclassification rate was high), and lower
weights to those that were easy to classify (where the misclassification rate was low). In the context of
C&RT for example, different misclassification costs (for the different classes) can be applied, inversely
proportional to the accuracy of prediction in each class. Then apply the classifier again to the weighted
data (or with different misclassification costs), and continue with the next iteration (application of the
analysis method for classification to the re-weighted data).
Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an
"expert" in classifying observations that were not well classified by those preceding it. During
deployment (for prediction or classification of new cases), the predictions from the different classifiers
can then be combined (e.g., via voting, or some weighted voting procedure) to derive a single best
prediction or classification.
Note that boosting can also be applied to learning methods that do not explicitly support weights or
misclassification costs. In that case, random sub-sampling can be applied to the learning data in the
successive steps of the iterative boosting procedure, where the probability for selection of an
observation into the subsample is inversely proportional to the accuracy of the prediction for that
observation in the previous iteration (in the sequence of iterations of the boosting procedure).
CRISP
See Models for Data Mining.
Deployment
The concept of deployment in predictive data mining refers to the application of a model for
prediction or classification to new data. After a satisfactory model or set of models has been identified
(trained) for a particular application, one usually wants to deploy those models so that predictions or
predicted classifications can quickly be obtained for new data. For example, a credit card company may
want to deploy a trained model or set of models (e.g., neural networks, meta-learner) to quickly
identify transactions which have a high probability of being fraudulent.
Drill-Down Analysis
The concept of drill-down analysis applies to the area of data mining, to denote the interactive
exploration of data, in particular of large databases. The process of drill-down analyses begins by
considering some simple break-downs of the data by a few variables of interest (e.g., Gender,
geographic region, etc.). Various statistics, tables, histograms, and other graphical summaries can be
computed for each group. Next one may want to "drill-down" to expose and further analyze the data
"underneath" one of the categorizations, for example, one might want to further review the data for
males from the mid-west. Again, various statistical and graphical summaries can be computed for those
cases only, which might suggest further break-downs by other variables (e.g., income, age, etc.). At
the lowest ("bottom") level are the raw data: For example, you may want to review the addresses of
male customers from one region, for a certain income group, etc., and to offer to those customers
some particular services of particular utility to that group.
Feature Selection
One of the preliminary stage in predictive data mining, when the data set includes more variables
than could be included (or would be efficient to include) in the actual model building phase (or even in
initial exploratory operations), is to select predictors from a large list of candidates. For example, when
data are collected via automated (computerized) methods, it is not uncommon that measurements are
recorded for thousands or hundreds of thousands (or more) of predictors. The standard analytic
methods for predictive data mining, such as neural network analyses, classification and
regression trees, generalized linear models, or general linear models become impractical
when the number of predictors exceed more than a few hundred variables.
Feature selection selects a subset of predictors from a large list of candidate predictors without
assuming that the relationships between the predictors and the dependent or outcome variables of
interest are linear, or even monotone. Therefore, this is used as a pre-processor for predictive data
mining, to select manageable sets of predictors that are likely related to the dependent (outcome)
variables of interest, for further analyses with any of the other methods for regression and
classification.
Machine Learning
Machine learning, computational learning theory, and similar terms are often used in the context of
Data Mining, to denote the application of generic model-fitting or classification algorithms for
predictive data mining. Unlike traditional statistical data analysis, which is usually concerned with
the estimation of population parameters by statistical inference, the emphasis in data mining (and
machine learning) is usually on the accuracy of prediction (predicted classification), regardless of
whether or not the "models" or techniques that are used to generate the prediction is interpretable or
open to simple explanation. Good examples of this type of technique often applied to predictive data
mining are neural networks or meta-learning techniques such as boosting, etc. These methods
usually involve the fitting of very complex "generic" models, that are not related to any reasoning or
theoretical understanding of underlying causal processes; instead, these techniques can be shown to
generate accurate predictions or classification in cross validation samples.
Meta-Learning
The concept of meta-learning applies to the area of predictive data mining, to combine the
predictions from multiple models. It is particularly useful when the types of models included in the
project are very different. In this context, this procedure is also referred to as Stacking (Stacked
Generalization).
Suppose your data mining project includes tree classifiers, such as C&RT and CHAID, linear
discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications
for a cross validation sample, from which overall goodness-of-fit statistics (e.g., misclassification
rates) can be computed. Experience has shown that combining the predictions from multiple methods
often yields more accurate predictions than can be derived from any one method (e.g., see Witten and
Frank, 2000). The predictions from different classifiers can be used as input into a meta-learner, which
will attempt to combine the predictions to create a final best predicted classification. So, for example,
the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s)
can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from
the data how to combine the predictions from the different models to yield maximum classification
accuracy.
One can apply meta-learners to the results from different meta-learners to create "meta-meta"-
learners, and so on; however, in practice such exponential increase in the amount of data processing,
in order to derive an accurate prediction, will yield less and less marginal utility.
One such model, CRISP (Cross-Industry Standard Process for data mining) was proposed in the mid-
1990s by a European consortium of companies to serve as a non-proprietary standard process model
for data mining. This general approach postulates the following (perhaps not particularly controversial)
general sequence of steps for data mining projects:
Another approach - the Six Sigma methodology - is a well-structured, data-driven methodology for
eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery,
management, and other business activities. This model has recently become very popular (due to its
successful implementations) in various American industries, and it appears to gain favor worldwide. It
postulated a sequence of, so-called, DMAIC steps -
- that grew up from the manufacturing, quality improvement, and process control traditions and is
particularly well suited to production environments (including "production of services," i.e., service
industries).
Another framework of this kind (actually somewhat similar to Six Sigma) is the approach proposed by
SAS Institute called SEMMA -
- which is focusing more on the technical activities typically involved in a data mining project.
All of these models are concerned with the process of how to integrate data mining methodology into an
organization, how to "convert data into information," how to involve important stake-holders, and how
to disseminate the information in a form that can easily be converted by stake-holders into resources
for strategic decision making.
Some software tools for data mining are specifically designed and documented to fit into one of these
specific frameworks.
The general underlying philosophy of StatSoft's STATISTICA Data Miner is to provide a flexible data
mining workbench that can be integrated into any organization, industry, or organizational culture,
regardless of the general data mining process-model that the organization chooses to adopt. For
example, STATISTICA Data Miner can include the complete set of (specific) necessary tools for ongoing
company wide Six Sigma quality control efforts, and users can take advantage of its (still optional)
DMAIC-centric user interface for industrial data mining tools. It can equally well be integrated into
ongoing marketing research, CRM (Customer Relationship Management) projects, etc. that follow either
the CRISP or SEMMA approach - it fits both of them perfectly well without favoring either one. Also,
STATISTICA Data Miner offers all the advantages of a general data mining oriented "development kit"
that includes easy to use tools for incorporating into your projects not only such components as custom
database gateway solutions, prompted interactive queries, or proprietary algorithms, but also systems
of access privileges, workgroup management, and other collaborative work tools that allow you to
design large scale, enterprise-wide systems (e.g., following the CRISP, SEMMA, or a combination of
both models) that involve your entire organization.
SEMMA
See Models for Data Mining.
Stacked Generalization
See Stacking.
Suppose your data mining project includes tree classifiers, such as C&RT or CHAID, linear
discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications
for a cross validation sample, from which overall goodness-of-fit statistics (e.g., misclassification
rates) can be computed. Experience has shown that combining the predictions from multiple methods
often yields more accurate predictions than can be derived from any one method (e.g., see Witten and
Frank, 2000). In stacking, the predictions from different classifiers are used as input into a meta-
learner, which attempts to combine the predictions to create a final best predicted classification. So,
for example, the predicted classifications from the tree classifiers, linear model, and the neural network
classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to
"learn" from the data how to combine the predictions from the different models to yield maximum
classification accuracy.
Other methods for combining the prediction from multiple models or methods (e.g., from multiple
datasets used for learning) are Boosting and Bagging (Voting).
Text Mining
While Data Mining is typically concerned with the detection of patterns in numeric data, very often
important (e.g., critical to business) information is stored in the form of text. Unlike numeric data, text
is often amorphous, and difficult to deal with. Text mining generally consists of the analysis of
(multiple) text documents by extracting key phrases, concepts, etc. and the preparation of the text
processed in that manner for further analyses with numeric data mining techniques (e.g., to determine
co-occurrences of concepts, key phrases, names, addresses, product names, etc.).
Most organizations have multiple formats and locations in which data is stored. To support decision-
making, improve system performance, or upgrade existing systems, data often must be moved from
one data storage location to another.
Microsoft® SQL Server™ 2000 Data Transformation Services (DTS) provides a set of tools that lets you
extract, transform, and consolidate data from disparate sources into single or multiple destinations. By
using DTS tools, you can create custom data movement solutions tailored to the specialized needs of
your organization, as shown in the following scenarios:
• To copy and transform your data, you can build a DTS solution that copies database
objects from the original data source into a SQL Server 2000 database, while at the
same time remapping columns and changing data types. You can run this solution
using DTS tools, or you can embed the solution within your application.
• You must consolidate several key Microsoft Excel spreadsheets into a SQL Server
database. Several departments create the spreadsheets at the end of the month,
but there is no set schedule for completion of all the spreadsheets.
• To consolidate the spreadsheet data, you can build a DTS solution that runs when a
message is sent to a message queue. The message triggers DTS to extract data
from the spreadsheet, perform any defined transformations, and load the data into
a SQL Server database.
• Your data warehouse contains historical data about your business operations, and
you use Microsoft SQL Server 2000 Analysis Services to summarize the data. Your
data warehouse needs to be updated nightly from your Online Transaction
Processing (OLTP) database. Your OLTP system is in-use 24-hours a day, and
performance is critical.
You can build a DTS solution that uses the file transfer protocol (FTP) to move data
files onto a local drive, loads the data into a fact table, and aggregates the data
using Analysis Services. You can schedule the DTS solution to run every night, and
you can use the new DTS logging options to track how long this process takes,
allowing you to analyze performance over time.
What Is DTS?
DTS is a set of tools you can use to import, export, and transform heterogeneous data between one or
more data sources, such as Microsoft SQL Server, Microsoft Excel, or Microsoft Access. Connectivity is
provided through OLE DB, an open-standard for data access. ODBC (Open Database Connectivity) data
sources are supported through the OLE DB Provider for ODBC.
You create a DTS solution as one or more packages. Each package may contain an organized set of
tasks that define work to be performed, transformations on data and objects, workflow constraints that
define task execution, and connections to data sources and destinations. DTS packages also provide
services, such as logging package execution details, controlling transactions, and handling global
variables.
These tools are available for creating and executing DTS packages:
• The Import/Export Wizard is for building relatively simple DTS packages, and
supports data migration and simple transformations.
• The DTS Designer graphically implements the DTS object model, allowing you to
create DTS packages with a wide range of functionality.
Using the DTS object model, you also can create and run packages programmatically, build custom
tasks, and build custom transformations.
Microsoft SQL Server 2000 introduces several DTS enhancements and new features:
• New DTS tasks include the FTP task, the Execute Package task, the Dynamic
Properties task, and the Message Queue task.
• Enhanced logging saves information for each package execution, allowing you to
maintain a complete execution history and view information for each process within
a task. You can generate exception files, which contain rows of data that could not
be processed due to errors.
• A new multiphase data pump allows advanced users to customize the operation of
data transformations at various stages. Also, you can use global variables as input
parameters for queries.
• You can use parameterized source queries in DTS transformation tasks and the
Execute SQL task.
• You can use the Execute Package task to dynamically assign the values of global
variables from a parent package to a child package.
DTS Designer graphically implements the DTS object model, allowing you to graphically create DTS
packages. You can use DTS Designer to:
• Create a package that includes complex workflows that include multiple steps using
conditional logic, event-driven code, or multiple connections to data sources.
The DTS Designer interface consists of a work area for building packages, toolbars containing package
elements that you can drag onto the design sheet, and menus containing workflows and package
management commands.
By dragging connections and tasks onto the design sheet, and specifying the order of execution with
workflows, you can easily build powerful DTS packages using DTS Designer. The following sections
define tasks, workflows, connections, and transformations, and illustrate the ease of using DTS
Designer to implement a DTS solution.
A DTS package usually includes one or more tasks. Each task defines a work item that may be
performed during package execution. You can use tasks to:
• Transform data
You also can create custom tasks programmatically, and then integrate them into DTS Designer using
the Register Custom Task command.
To illustrate the use of tasks, here is a simple DTS Package with two tasks: a Microsoft ActiveX® Script
task and a Send Mail task:
The ActiveX Script task can host any ActiveX Scripting engine including Microsoft Visual Basic Scripting
Edition (VBScript), Microsoft JScript®, or ActiveState ActivePerl, which you can download from
http://www.activestate.com . The Send Mail task may send a message indicating that the package
has run. Note that there is no order to these tasks yet. When the package executes, the ActiveX Script
task and the Send Mail task run concurrently.
When you define a group of tasks, there is usually an order in which the tasks should be performed.
When tasks have an order, each task becomes a step of a process. In DTS Designer, you manipulate
tasks on the DTS Designer design sheet and use precedence constraints to control the sequence in
which the tasks execute.
Precedence constraints sequentially link tasks in a package. The following table shows the types of
precedence constraints you can use in DTS.
Precedence Description
constraint
The following illustration shows the ActiveX Script task and the Send Mail task with an On Completion
precedence constraint. When the Active X Script task completes, with either success or failure, the Send
Mail task runs.
Figure 3: ActiveX Script task and the Send Mail task with an On Completion precedence
constraint
You can configure separate Send Mail tasks, one for an On Success constraint and one for an On Failure
constraint. The two Send Mail tasks can send different messages based on the success or failure of the
ActiveX script.
You also can issue multiple precedence constraints on a task. For example, the Send Mail task "Admin
Notification" could have both an On Success constraint from Script #1 and an On Failure constraint
from Script #2. In these situations, DTS assumes a logical "AND" relationship. Therefore, Script #1
must successfully execute and Script #2 must fail for the Admin Notification message to be sent.
To successfully execute DTS tasks that copy and transform data, a DTS package must establish valid
connections to its source and destination data and to any additional data sources, such as lookup
tables.
When creating a package, you configure connections by selecting a connection type from a list of
available OLE DB providers and ODBC drivers. The types of connections that are available are:
• Other drivers
DTS allows you to use any OLE DB connection. The icons on the Connections toolbar provide easy
access to common connections.
The following illustration shows a package with two connections. Data is being copied from an Access
database (the source connection) into a SQL Server production database (the destination connection).
The first step in this package is an Execute SQL task, which checks to see if the destination table
already exists. If so, the table is dropped and re-created. On the success of the Execute SQL task, data
is copied to the SQL Server database in Step 2. If the copy operation fails, an e-mail is sent in Step 3.
The DTS data pump is a DTS object that drives the import, export, and transformation of data. The
data pump is used during the execution of the Transform Data, Data Driven Query, and Parallel Data
Pump tasks. These tasks work by creating rowsets on the source and destination connections, then
creating an instance of the data pump to move rows between the source and destination.
In the following illustration, a Transform Data task is used between the Access DB task and the SQL
Production DB task in Step 2. The Transform Data task is the gray arrow between the connections.
To define the data gathered from the source connection, you can build a query for the transformation
tasks. DTS supports parameterized queries, which allow you to define query values when the query is
executed.
You can type a query into the task's Properties dialog box, or use the Data Transformation Services
Query Designer, a tool for graphically building queries for DTS tasks. In the following illustration, the
Query Designer is used to build a query that joins three tables in the pubs database.
In the transformation tasks, you also define any changes to be made to data. The following table
describes the built-in transformations that DTS provides.
Transformation Description
You can also create your own custom transformations programmatically. The quickest way to build
custom transformations is to use the Active Template Library (ATL) custom transformation template,
which is included in the SQL Server 2000 DTS sample programs.
A new method of logging transformation errors is available in SQL Server 2000. You can define three
exception log files for use during package execution: an error text file, a source error rows file, and a
destination error rows file.
• If a transformation fails, then the source row is in error, and that row is written to
the source error rows file.
• If an insert fails, then the destination row is in error, and that row is written to the
destination error rows file.
The exception log files are defined in the tasks that transform data. Each transformation task has its
own log files.
By default, the data pump has one phase: row transformation. That phase is what you configure when
mapping column-level transformations in the Transform Data task, Data Driven Query task, and Parallel
Data Pump task, without selecting a phase.
Multiple data pump phases are new in SQL Server 2000. By selecting the multiphase data pump option
in SQL Server Enterprise Manager, you can access the data pump at several points during its operation
and add functionality.
When copying a row of data from source to a destination, the data pump follows the basic process
shown in the following illustration.
After the data pump processes the last row of data, the task is finished and the data pump operation
terminates.
Advanced users who want to add functionality to a package so that it supports any data pump phase
can do so by:
• Writing an ActiveX script phase function for each data pump phase to be
customized. If you use ActiveX script functions to customize data pump phases, no
additional code outside of the package is required.
• Creating a COM object in Microsoft Visual C++® to customize selected data pump
phases. You develop this program external to the package, and the program is
called for each selected phase of the transformation. Unlike the ActiveX script
method of accessing data pump phases, which uses a different function and entry
point for each selected phase, this method provides a single entry point that is
called by multiple data pump phases, while the data pump task executes.
Save your DTS package to Microsoft SQL Server if you want to store packages on
any instance of SQL Server on your network, keep a convenient inventory of those
packages, and add and delete package versions during the package development
process.
Save your DTS package to Meta Data Services if you plan to track package version,
meta data, and data lineage information.
Save your DTS package to a structured storage file if you want to copy, move, and
send a package across the network without having to store the package in a
Microsoft SQL Server database.
Save your DTS package that has been created by DTS Designer or the DTS
Import/Export Wizard to a Microsoft Visual Basic file if you want to incorporated it
into Visual Basic programs or use it as a prototype for DTS application
development.
The DTS Designer provides a wide variety of solutions to data movement tasks. DTS extends the
number of solutions available by providing programmatic access to the DTS object model. Using
Microsoft Visual Basic, Microsoft Visual C++, or any other application development system that
supports COM, you can develop a custom DTS solution for your environment using functionality
unsupported in the graphical tools.
• Building packages
You can develop extremely complex packages and access the full range of
functionality in the object model, without the using the DTS Designer or DTS
Import/Export Wizard.
• Extending packages
You can add new functionality through the construction of custom tasks and
transforms, customized for your business and reusable within DTS.
• Executing packages
Execution of DTS packages does not have to be from any of the tools provided, it is
possible to execute DTS packages programmatically and display progress through
COM events, allowing the construction of embedded or custom DTS execution
environments.
Sample DTS programs are available to help you get started with DTS programming. The samples can be
installed with SQL Server 2000.
If you develop a DTS application, you can redistribute the DTS files. For more information, see
Redist.txt on the SQL Server 2000 compact disc.
Association Rule:
In data mining, association rule learning is a popular and well researched method for discovering
interesting relations between variables in large databases. Piatetsky-Shapiro describes analyzing and
presenting strong rules discovered in databases using different measures of interestingness. Based on
the concept of strong rules, For example, the rule found in the
sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or
she is likely to also buy beef.
Clustering:
Clustering is the classification of objects into different groups, or more precisely, the partitioning of a
data set into subsets (clusters), so that the data in each subset (ideally) share some common trait -
often proximity according to some defined distance measure. Data clustering is a common technique for
statistical data analysis, which is used in many fields, including machine learning, data mining, pattern
recognition, image analysis and bioinformatics. The computational task of classifying the data set into k
clusters is often referred to as k-clustering.
Types of clustering
Data clustering algorithms can be hierarchical. Hierarchical algorithms find successive clusters using
previously established clusters. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive
("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them
into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it
into successively smaller clusters.
Partitional algorithms typically determine all clusters at once, but can also be used as divisive
algorithms in the hierarchical clustering.
Two-way clustering, co-clustering or biclustering are clustering methods where not only the objects are
clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows
and columns are clustered simultaneously.
Another important distinction is whether the clustering uses symmetric or asymmetric distances. A
property of Euclidean space is that distances are symmetric (the distance from object A to B is the
same as the distance from B to A). In other applications (e.g., sequence-alignment methods, see
Prinzie & Van den Poel (2006)), this is not the case.
Data classification is the determining of class intervals and class boundaries in that data to be mapped
and it depends in part on the number of observations. Most of the maps are designed with 4-6
classifications however with more observations you have to choose a large number of classes but too
many classes are also not good, since it makes the map interpretation difficult. There are four
classification methods for making a graduated color or graduated symbol map. All these methods reflect
different patterns affecting the map display.
This method is based on subjective decision and it is best choice for combining similar values. Since the
class ranges are specific to individual dataset, it is difficult to compare a map with another map and to
choose the optimum number of classes especially if the data is evenly distributed.
Quantile Classification
Quantile classification method distributes a set of values into groups that contain an equal number of
values. This method places the same number of data values in each class and will never have empty
classes or classes with too few or too many values. It is attractive in that this method always produces
distinct map patterns.