Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH MTECH 2nd SEMESTER
PREPARED BY ARUN PRATAP SINGH 1

1

DESIGN OF DATA WAREHOUSE :
The term "Data Warehouse" was first coined by Bill Inmon in 1990. He said that Data warehouse
is subject Oriented, Integrated, Time-Variant and nonvolatile collection of data.This data helps in
supporting decision making process by analyst in an organization
The operational database undergoes the per day transactions which causes the frequent changes
to the data on daily basis.But if in future the business executive wants to analyse the previous
feedback on any data such as product,supplier,or the consumer data. In this case the analyst will
be having no data available to analyse because the previous data is updated due to transactions.
The Data Warehouses provide us generalized and consolidated data in multidimensional view.
Along with generalize and consolidated view of data the Data Warehouses also provide us Online
Analytical Processing (OLAP) tools. These tools help us in interactive and effective analysis of
data in multidimensional space. This analysis results in data generalization and data mining.
The data mining functions like association,clustering ,classification, prediction can be integrated
with OLAP operations to enhance interactive mining of knowledge at multiple level of abstraction.
That's why data warehouse has now become important platform for data analysis and online
analytical processing.
Understanding Data Warehouse
The Data Warehouse is that database which is kept separate from the organization's operational
database.
There is no frequent updation done in data warehouse.
Data warehouse possess consolidated historical data which help the organization to analyse it's
business.
Data warehouse helps the executives to organize,understand and use their data to take strategic
decision.
Data warehouse systems available which helps in integration of diversity of application systems.
The Data warehouse system allows analysis of consolidated historical data analysis.
Definition : Data warehouse is Subject Oriented, Integrated, Time-Variant and Nonvolatile
collection of data that support management's decision making process.
Why Data Warehouse Separated from Operational Databases
The following are the reasons why Data Warehouse are kept separate from operational
databases:
The operational database is constructed for well known tasks and workload such as searching
particular records, indexing etc but the data warehouse queries are often complex and it presents
the general form of data.
UNIT : V


2
Operational databases supports the concurrent processing of multiple transactions. Concurrency
control and recovery mechanism are required for operational databases to ensure robustness
and consistency of database.
Operational database query allow to read, modify operations while the OLAP query need only
read only access of stored data.
Operational database maintain the current data on the other hand data warehouse maintain the
historical data.
Data Warehouse Features
The key features of Data Warehouse such as Subject Oriented, Integrated, Nonvolatile and Time-
Variant are are discussed below:
Subject Oriented - The Data Warehouse is Subject Oriented because it provide us the
information around a subject rather the organization's ongoing operations. These subjects can be
product, customers, suppliers, sales, revenue etc. The data warehouse does not focus on the
ongoing operations rather it focuses on modelling and analysis of data for decision making.
Integrated - Data Warehouse is constructed by integration of data from heterogeneous sources
such as relational databases, flat files etc. This integration enhance the effective analysis of data.
Time-Variant - The Data in Data Warehouse is identified with a particular time period. The data
in data warehouse provide information from historical point of view.
Non Volatile - Non volatile means that the previous data is not removed when new data is added
to it. The data warehouse is kept separate from the operational database therefore frequent
changes in operational database is not reflected in data warehouse.
Note: - Data Warehouse does not require transaction processing, recovery and concurrency
control because it is physically stored separate from the operational database.
Data Warehouse Applications
As discussed before Data Warehouse helps the business executives in organize, analyse and
use their data for decision making. Data Warehouse serves as a soul part of a plan-execute-
assess "closed-loop" feedback system for enterprise management. Data Warehouse is widely
used in the following fields:
financial services
Banking Services
Consumer goods
Retail sectors.
Controlled manufacturing
Data Warehouse Types
Information processing, Analytical processing and Data Mining are the three types of data
warehouse applications that are discussed below:


3
Information processing - Data Warehouse allow us to process the information stored in it. The
information can be processed by means of querying, basic statistical analysis, reporting using
crosstabs, tables, charts, or graphs.
Analytical Processing - Data Warehouse supports analytical processing of the information
stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-
dice, drill down, drill up, and pivoting.
Data Mining - Data Mining supports knowledge discovery by finding the hidden patterns and
associations, constructing analytical models, performing classification and prediction.These
mining results can be presented using the visualization tools.

SN Data Warehouse (OLAP) Operational Database(OLTP)
1
This involves historical processing of
information.
This involves day to day processing.
2
OLAP systems are used by knowledge
workers such as executive, manager and
analyst.
OLTP system are used by clerk, DBA, or
database professionals.
3 This is used to analysis the business. This is used to run the business.
4 It focuses on Information out. It focuses on Data in.
5
This is based on Star Schema, Snowflake
Schema and Fact Constellation Schema.
This is based on Entity Relationship Model.
6 It focuses on Information out. This is application oriented.
7 This contains historical data. This contains current data.
8
This provides summarized and
consolidated data.
This provide primitive and highly detailed data.
9
This provide summarized and
multidimensional view of data.
This provides detailed and flat relational view of
data.
10 The number or users are in Hundreds. The number of users are in thousands.
11
The number of records accessed are in
millions.
The number of records accessed are in tens.
12 The database size is from 100GB to TB The database size is from 100 MB to GB.
13 This are highly flexible. This provide high performance.


4
What is Data Warehousing?
Data Warehousing is the process of constructing and using the data warehouse. The data
warehouse is constructed by integrating the data from multiple heterogeneous sources. This data
warehouse supports analytical reporting, structured and/or ad hoc queries and decision making.
Data Warehousing involves data cleaning, data integration and data consolidations.
Using Data Warehouse Information
There are decision support technologies available which help to utilize the data warehouse. These
technologies helps the executives to use the warehouse quickly and effectively. They can gather
the data, analyse it and take the decisions based on the information in the warehouse. The
information gathered from the warehouse can be used in any of the following domains:
Tuning production strategies - The product strategies can be well tuned by repositioning the
products and managing product portfolios by comparing the sales quarterly or yearly.
Customer Analysis - The customer analysis is done by analyzing the customer's buying
preferences, buying time, budget cycles etc.
Operations Analysis - Data Warehousing also helps in customer relationship management,
making environmental corrections. The Information also allow us to analyse the business
operations.
In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is
a database used for reporting and data analysis. Integrating data from one or more disparate
sources creates a central repository of data, a data warehouse (DW). Data warehouses store
current and historical data and are used for creating trending reports for senior management
reporting such as annual and quarterly comparisons.
The data stored in the warehouse is uploaded from the operational systems (such as marketing,
sales, etc.). The data may pass through an operational data store for additional operations before
it is used in the DW for reporting.


5

Data warehouses support business decisions by collecting, consolidating, and organizing data for
reporting and analysis with tools such as online analytical processing (OLAP) and data mining.
Although data warehouses are built on relational database technology, the design of a data
warehouse database differs substantially from the design of an online transaction processing
system (OLTP) database.

Data Warehouses, OLTP, OLAP, and Data Mining

A relational database is designed for a specific purpose. Because the purpose of a data
warehouse differs from that of an OLTP, the design characteristics of a relational database that
supports a data warehouse differ from the design characteristics of an OLTP database.

A Data Warehouse Supports OLTP-
A data warehouse supports an OLTP system by providing a place for the OLTP database to
offload data as it accumulates, and by providing services that would complicate and degrade
OLTP operations if they were performed in the OLTP database.
Without a data warehouse to hold historical information, data is archived to static media such as
magnetic tape, or allowed to accumulate in the OLTP database.
If data is simply archived for preservation, it is not available or organized for use by analysts and
decision makers. If data is allowed to accumulate in the OLTP so it can be used for analysis, the


6
OLTP database continues to grow in size and requires more indexes to service analytical and
report queries. These queries access and process large portions of the continually growing
historical data and add a substantial load to the database. The large indexes needed to support
these queries also tax the OLTP transactions with additional index maintenance. These queries
can also be complicated to develop due to the typically complex OLTP database schema.
A data warehouse offloads the historical data from the OLTP, allowing the OLTP to operate at
peak transaction efficiency. High volume analytical and reporting queries are handled by the data
warehouse and do not load the OLTP, which does not need additional indexes for their support.
As data is moved to the data warehouse, it is also reorganized and consolidated so that analytical
queries are simpler and more efficient.

OLAP is a Data Warehouse Tool-
Online analytical processing (OLAP) is a technology designed to provide superior performance
for ad hoc business intelligence queries. OLAP is designed to operate efficiently with data
organized in accordance with the common dimensional model used in data warehouses.
A data warehouse provides a multidimensional view of data in an intuitive model designed to
match the types of queries posed by analysts and decision makers. OLAP organizes data
warehouse data into multidimensional cubes based on this dimensional model, and then
preprocesses these cubes to provide maximum performance for queries that summarize data in
various ways. For example, a query that requests the total sales income and quantity sold for a
range of products in a specific geographical region for a specific time period can typically be
answered in a few seconds or less regardless of how many hundreds of millions of rows of data
are stored in the data warehouse database.
Data warehouse database OLTP database
Designed for analysis of business measures
by categories and attributes
Designed for real-time business operations
Optimized for bulk loads and large, complex,
unpredictable queries that access many
rows per table
Optimized for a common set of transactions,
usually adding or retrieving a single row at a
time per table
Loaded with consistent, valid data; requires
no real time validation
Optimized for validation of incoming data
during transactions; uses validation data
tables
Supports few concurrent users relative to
OLTP
Supports thousands of concurrent users


7
OLAP is not designed to store large volumes of text or binary data, nor is it designed to support
high volume update transactions. The inherent stability and consistency of historical data in a data
warehouse enables OLAP to provide its remarkable performance in rapidly summarizing
information for analytical queries.
In SQL Server 2000, Analysis Services provides tools for developing OLAP applications and a
server specifically designed to service OLAP queries.


8


9


10

Data Warehouse Tools and Utilities Functions
The following are the functions of Data Warehouse tools and Utilities:
Data Extraction - Data Extraction involves gathering the data from multiple heterogeneous
sources.
Data Cleaning - Data Cleaning involves finding and correcting the errors in data.


11
Data Transformation - Data Transformation involves converting data from legacy format to
warehouse format.
Data Loading - Data Loading involves sorting, summarizing, consolidating, checking integrity and
building indices and partitions.
Refreshing - Refreshing involves updating from data sources to warehouse.
Note: Data Cleaning and Data Transformation are important steps in improving the quality of data
and data mining results.

Data Warehouse :
Data warehouse is subject Oriented, Integrated, Time-Variant and nonvolatile collection of data
that support of management's decision making process. Let's explore this Definition of data
warehouse.
Subject Oriented - The Data warehouse is subject oriented because it provide us the information
around a subject rather the organization's ongoing operations. These subjects can be product,
customers, suppliers, sales, revenue etc. The data warehouse does not focus on the ongoing
operations rather it focuses on modelling and analysis of data for decision making.
Integrated - Data Warehouse is constructed by integration of data from heterogeneous sources
such as relational databases, flat files etc. This integration enhance the effective analysis of data.
Time-Variant - The Data in Data Warehouse is identified with a particular time period. The data
in data warehouse provide information from historical point of view.
Non Volatile - Non volatile means that the previous data is not removed when new data is added
to it. The data warehouse is kept separate from the operational database therefore frequent
changes in operational database is not reflected in data warehouse.
Metadata - Metadata is simply defined as data about data. The data that are used to represent
other data is known as metadata. For example the index of a book serve as metadata for the
contents in the book.In other words we can say that metadata is the summarized data that lead
us to the detailed data.
In terms of data warehouse we can define metadata as following:
Metadata is a road map to data warehouse.
Metadata in data warehouse define the warehouse objects.
The metadata act as a directory.This directory helps the decision support system to locate the
contents of data warehouse.
Metadata Respiratory :
The Metadata Respiratory is an integral part of data warehouse system. The Metadata
Respiratory contains the following metadata:
Business Metadata - This metadata has the data ownership information, business definition and
changing policies.
Operational Metadata -This metadata includes currency of data and data lineage. Currency of
data means whether data is active, archived or purged. Lineage of data means history of data
migrated and transformation applied on it.


12
Data for mapping from operational environment to data warehouse -This metadata includes
source databases and their contents, data extraction,data partition, cleaning, transformation
rules, data refresh and purging rules.
The algorithms for summarization - This includes dimension algorithms, data on granularity,
aggregation, summarizing etc.
Data cube :
Data cube help us to represent the data in multiple dimensions. The data cube is defined by
dimensions and facts. The dimensions are the entities with respect to which an enterprise keep
the records.
Illustration of Data cube
Suppose a company wants to keep track of sales records with help of sales data warehouse with
respect to time, item, branch and location. These dimensions allow to keep track of monthly sales
and at which branch the items were sold.There is a table associated with each dimension. This
table is known as dimension table. This dimension table further describes the dimensions. For
example "item" dimension table may have attributes such as item_name, item_type and
item_brand.
The following table represents 2-D view of Sales Data for a company with respect to time,item
and location dimensions.

But here in this 2-D table we have records with respect to time and item only. The sales for New
Delhi are shown with respect to time and item dimensions according to type of item sold. If we
want to view the sales data with one new dimension say the location dimension. The 3-D view of
the sales data with respect to time, item, and location is shown in the table below:


13

The above 3-D table can be represented as 3-D data cube as shown in the following figure:

DATA MART :
Data mart contains the subset of organization-wide data. This subset of data is valuable to specific
group of an organization. In other words we can say that data mart contains only that data which
is specific to a particular group. For example the marketing data mart may contain only data
related to item, customers and sales. The data mart are confined to subjects.


14
Points to remember about data marts:
Window based or Unix/Linux based servers are used to implement data marts. They are
implemented on low cost server.
The implementation cycle of data mart is measured in short period of time i.e. in weeks rather
than months or years.
The life cycle of a data mart may be complex in long run if it's planning and design are not
organization-wide.
Data mart are small in size.
Data mart are customized by department.
The source of data mart is departmentally structured data warehouse.
Data mart are flexible.
Graphical Representation of data mart.

A data mart is the access layer of the data warehouse environment that is used to get data out to the
users. The data mart is a subset of the data warehouse that is usually oriented to a specific business
line or team. Data marts are small slices of the data warehouse. Whereas data warehouses have an
enterprise-wide depth, the information in data marts pertains to a single department. In some
deployments, each department or business unit is considered the owner of its data mart including all
the hardware, software and data.
[1]
This enables each department to use, manipulate and develop
their data any way they see fit; without altering information inside other data marts or the data
warehouse. In other deployments where conformed dimensions are used, this business unit ownership
will not hold true for shared dimensions like customer, product, etc.


15
The reasons why organizations are building data warehouses and data marts are because the
information in the database is not organized in a way that makes it easy for organizations to find what
they need. Also complicated queries might take a long time to answer what people want to know since
the database systems are designed to process millions of transactions per day. Transactional
database are designed to be updated, however, data warehouses or marts are read only. Data
warehouses are designed to access large groups of related records.
Data marts improve end-user response time by allowing users to have access to the specific type of
data they need to view most often by providing the data in a way that supports the collective view of a
group of users.
A data mart is basically a condensed and more focused version of a data warehouse that reflects the
regulations and process specifications of each business unit within an organization. Each data mart is
dedicated to a specific business function or region. This subset of data may span across many or all
of an enterprises functional subject areas. It is common for multiple data marts to be used in order to
serve the needs of each individual business unit (different data marts can be used to obtain specific
information for various enterprise departments, such as accounting, marketing, sales, etc.).

Reasons for creating a data mart :
Easy access to frequently needed data
Creates collective view by a group of users
Improves end-user response time
Ease of creation
Lower cost than implementing a full data warehouse
Potential users are more clearly defined than in a full data warehouse
Contains only business essential data and is less cluttered.


16

DEPENDENT DATA MART :
According to the Inmon school of data warehousing, a dependent data mart is a logical subset (view)
or a physical subset (extract) of a larger data warehouse, isolated for one of the following reasons:
A need refreshment for a special data model or schema: e.g., to restructure for OLAP
Performance: to offload the data mart to a separate computer for greater efficiency or to obviate
the need to manage that workload on the centralized data warehouse.
Security: to separate an authorized data subset selectively
Expediency: to bypass the data governance and authorizations required to incorporate a new
application on the Enterprise Data Warehouse
Proving Ground: to demonstrate the viability and ROI (return on investment) potential of an
application prior to migrating it to the Enterprise Data Warehouse
Politics: a coping strategy for IT (Information Technology) in situations where a user group has
more influence than funding or is not a good citizen on the centralized data warehouse.
Politics: a coping strategy for consumers of data in situations where a data warehouse team is
unable to create a usable data warehouse.
According to the Inmon school of data warehousing, tradeoffs inherent with data marts include limited
scalability, duplication of data, data inconsistency with other silos of information, and inability to
leverage enterprise sources of data.
The alternative school of data warehousing is that of Ralph Kimball. In his view, a data warehouse is
nothing more than the union of all the data marts. This view helps to reduce costs and provides fast
development, but can create an inconsistent data warehouse, especially in large organizations.
Therefore, Kimball's approach is more suitable for small-to-medium corporations.


17

Virtual Warehouse :
The view over a operational data warehouse is known as virtual warehouse. It is easy to built the
virtual warehouse. Building the virtual warehouse requires excess capacity on operational
database servers.

PROCESS FLOW IN DATA WAREHOUSE :
There are four major processes that build a data warehouse. Here is the list of four processes:
Extract and load data.
Cleaning and transforming the data.
Backup and Archive the data.
Managing queries & directing them to the appropriate data sources.

Extract and Load Process
The Data Extraction takes data from the source systems.
Data load takes extracted data and loads it into data warehouse.
Note: Before loading the data into data warehouse the information extracted from external
sources must be reconstructed.
Points to remember while extract and load process:
Controlling the process
When to Initiate Extract


18
Loading the Data
CONTROLLING THE PROCESS
Controlling the process involves determining that when to start data extraction and consistency
check on data. Controlling process ensures that tools, logic modules, and the programs are
executed in correct sequence and at correct time.
WHEN TO INITIATE EXTRACT
Data need to be in consistent state when it is extracted i.e. the data warehouse should represent
single, consistent version of information to the user.
For example in a customer profiling data warehouse in telecommunication sector it is illogical to
merge list of customers at 8 pm on wednesday from a customer database with the customer
subscription events up to 8 pm on tuesday. This would mean that we are finding the customers
for whom there are no associated subscription.
LOADING THE DATA
After extracting the data it is loaded into a temporary data store.Here in the temporary data store
it is cleaned up and made consistent.
Note: Consistency checks are executed only when all data sources have been loaded into
temporary data store.
Clean and Transform Process
Once data is extracted and loaded into temporary data store it is the time to perform Cleaning
and Transforming. Here is the list of steps involved in Cleaning and Transforming:
Clean and Transform the loaded data into a structure.
Partition the data.
Aggregation
CLEAN AND TRANSFORM THE LOADED DATA INTO A STRUCTURE
This will speed up the queries.This can be done in the following ways:
Make sure data is consistent within itself.
Make sure data is consistent with other data within the same data source.
Make sure data is consistent with data in other source systems.
Make sure data is consistent with data already in the warehouse.


19
Transforming involves converting the source data into a structure. Structuring the data will result
in increases query performance and decreases operational cost. Information in data warehouse
must be transformed to support performance requirement from the business and also the ongoing
operational cost.
PARTITION THE DATA
It will optimize the hardware performance and simplify the management of data warehouse. In
this we partition each fact table into a multiple separate partitions.
AGGREGATION
Aggregation is required to speed up the common queries. Aggregation rely on the fact that most
common queries will analyse a subset or an aggregation of the detailed data.
Backup and Archive the data
In order to recover the data in event of data loss, software failure or hardware failure it is
necessary to backed up on regular basis.Archiving involves removing the old data from the
system in a format that allow it to be quickly restored whenever required.
For example in a retail sales analysis data warehouse, it may be required to keep data for 3 years
with latest 6 months data being kept online. In this kind of scenario there is often requirement to
be able to do month-on-month comparisons for this year and last year. In this case we require
some data to be restored from the archive.
Query Management Process
This process performs the following functions
This process manages the queries.
This process speed up the queries execution.
This Process direct the queries to most effective data sources.
This process should also ensure that all system sources are used in most effective way.
This process is also required to monitor actual query profiles.
Information in this process is used by warehouse management process to determine which
aggregations to generate.
This process does not generally operate during regular load of information into data warehouse.


20
THREE-TIER DATA WAREHOUSE ARCHITECTURE :
Generally the data warehouses adopt the three-tier architecture. Following are the three tiers of
data warehouse architecture.
Bottom Tier - The bottom tier of the architecture is the data warehouse database server.It is the
relational database system.We use the back end tools and utilities to feed data into bottom
tier.these back end tools and utilities performs the Extract, Clean, Load, and refresh functions.

Middle Tier - In the middle tier we have OLAp Server. the OLAP Server can be implemented in
either of the following ways.
o By relational OLAP (ROLAP), which is an extended relational database management system. The
ROLAP maps the operations on multidimensional data to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly implements multidimensional data and
operations.
Top-Tier - This tier is the front-end client layer. This layer hold the query tools and reporting tool,
analysis tools and data mining tools.
Following diagram explains the Three-tier Architecture of Data warehouse:


21
OLAP :
Introduction
Online Analytical Processing Server (OLAP) is based on multidimensional data model. It allows
the managers , analysts to get insight the information through fast, consistent, interactive access
to information. In this chapter we will discuss about types of OLAP, operations on OLAP,
Difference between OLAP and Statistical Databases and OLTP.
Types of OLAP Servers
We have four types of OLAP servers that are listed below.
Relational OLAP(ROLAP)
Multidimensional OLAP (MOLAP)
Hybrid OLAP (HOLAP)
Specialized SQL Servers

Relational OLAP(ROLAP)
The Relational OLAP servers are placed between relational back-end server and client front-end
tools. To store and manage warehouse data the Relational OLAP use relational or extended-
relational DBMS.
ROLAP includes the following.
implementation of aggregation navigation logic.
optimization for each DBMS back end.
additional tools and services.
Multidimensional OLAP (MOLAP)
Multidimensional OLAP (MOLAP) uses the array-based multidimensional storage engines for
multidimensional views of data.With multidimensional data stores, the storage utilization may be
low if the data set is sparse. Therefore many MOLAP Server uses the two level of data storage
representation to handle dense and sparse data sets.
Hybrid OLAP (HOLAP)
The hybrid OLAP technique combination of ROLAP and MOLAP both. It has both the higher
scalability of ROLAP and faster computation of MOLAP. HOLAP server allows to store the large
data volumes of detail data. the aggregations are stored separated in MOLAP store.


22
Specialized SQL Servers
specialized SQL servers provides advanced query language and query processing support for
SQL queries over star and snowflake schemas in a read-only environment.
OLAP Operations
As we know that the OLAP server is based on the multidimensional view of data hence we will
discuss the OLAP operations in multidimensional data.
Here is the list of OLAP operations.
Roll-up
Drill-down
Slice and dice
Pivot (rotate)

ROLL-UP
This operation performs aggregation on a data cube in any of the following way:
By climbing up a concept hierarchy for a dimension
By dimension reduction.
Consider the following diagram showing the roll-up operation.


23

The roll-up operation is performed by climbing up a concept hierarchy for the dimension location.
Initially the concept hierarchy was "street < city < province < country".
On rolling up the data is aggregated by ascending the location hierarchy from the level of city to
level of country.
The data is grouped into cities rather than countries.
When roll-up operation is performed then one or more dimensions from the data cube are
removed.
DRILL-DOWN
Drill-down operation is reverse of the roll-up. This operation is performed by either of the following
way:
By stepping down a concept hierarchy for a dimension.


24
By introducing new dimension.
Consider the following diagram showing the drill-down operation:

The drill-down operation is performed by stepping down a concept hierarchy for the dimension
time.
Initially the concept hierarchy was "day < month < quarter < year."
On drill-up the time dimension is descended from the level quarter to the level of month.
When drill-down operation is performed then one or more dimensions from the data cube are
added.
It navigates the data from less detailed data to highly detailed data.


25
SLICE
The slice operation performs selection of one dimension on a given cube and give us a new sub
cube. Consider the following diagram showing the slice operation.

The Slice operation is performed for the dimension time using the criterion time ="Q1".
It will form a new sub cube by selecting one or more dimensions.
DICE
The Dice operation performs selection of two or more dimension on a given cube and give us a
new subcube. Consider the following diagram showing the dice operation:


26

The dice operation on the cube based on the following selection criteria that involve three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem").
PIVOT
The pivot operation is also known as rotation.It rotates the data axes in view in order to provide
an alternative presentation of data.Consider the following diagram showing the pivot operation.


27

In this the item and location axes in 2-D slice are rotated.
OLAP vs OLTP
SN Data Warehouse (OLAP) Operational Database(OLTP)
1
This involves historical processing of
information.
This involves day to day processing.
2
OLAP systems are used by
knowledge workers such as
executive, manager and analyst.
OLTP system are used by clerk, DBA, or
database professionals.


28
3
This is used to analysis the
business.
This is used to run the business.
4 It focuses on Information out. It focuses on Data in.
5
This is based on Star Schema,
Snowflake Schema and Fact
Constellation Schema.
This is based on Entity Relationship
Model.
6 It focuses on Information out. This is application oriented.
7 This contains historical data. This contains current data.
8
This provides summarized and
consolidated data.
This provide primitive and highly detailed
data.
9
This provide summarized and
multidimensional view of data.
This provides detailed and flat relational
view of data.
10
The number or users are in
Hundreds.
The number of users are in thousands.
11
The number of records accessed are
in millions.
The number of records accessed are in
tens.
12
The database size is from 100GB to
TB
The database size is from 100 MB to GB.
13 This are highly flexible. This provide high performance.

CONCEPTUAL MODELING OF DATA WAREHOUSES :
Dimensional modeling is a technique for conceptualizing and visualizing data models as a set of
measures that are described by common aspects of the business. Dimensional modeling has two
basic concepts.
Facts:
A fact is a collection of related data items, consisting of measures.
A fact is a focus of interest for the decision making process.
Measures are continuously valued attributes that describe facts.


29
A fact is a business measure.
Dimension:
The parameter over which we want to perform analysis of facts
The parameter that gives meaning to a measure number of customers is a fact, perform analysis
over time.
Dimensional modeling also has emerged as the only coherent architecture for building distributed
DW systems. If we come up with more complex questions for our warehouse which involves three
or more dimensions.
This is where the multi-dimensional database plays a significant role analysis. Dimensions are
categories by which summarized data can be viewed. Cubes are data processing units composed
of fact tables and dimensions from the data warehouse. Dimensional modeling also has emerged
as the only coherent architecture for building distributed data warehouse systems.
Multi-Dimensional Modeling
Multidimensional database technology has come a long way since its inception more than 30
years ago. It has recently begun to reach the mass market, with major vendors now delivering
multidimensional engines along with their relational database offerings, often at no extra cost.
Multi-dimensional technology has also made significant gains in scalability and maturity.
Multidimensional data model emerged for use when the objective is to analyze rather than to
perform on-line transactions.
Multidimensional model is based on three key concepts:
Modeling business rules
Cube and measures
Dimensions
Multidimensional data-base technology is a key factor in the interactive analysis of large amounts
of data for decision-making purposes. Multidimensional data model is introduced based on
relational elements. Dimensions are modeled as dimension relations.
languages similar to structured query language. They can not treat all dimensions and measures
symmetrically the definition of multidimensional schema describes multiple levels along a
dimension, and there is at least one key attribute in each level that is included in the keys of the
star schema in RD systems. Multidimensional database enable end-users to model data in a
multidimensional environment. This is real product strength, as it provides for the fastest, most
flexible method to process multidimensional requests.


30

The principal characteristic of a dimensional model is a set of detailed business facts surrounded
by multiple dimensions that describe those facts. When realized in a database, the schema for a
dimensional model contains a central fact table and multiple dimension tables. A dimensional
model may produce a star schema or a snowflake schema.
The schema is a logical description of the entire database. The schema includes the name and
description of records of all record types including all associated data-items and aggregates.
Likewise the database the data warehouse also require the schema. The database uses the
relational model on the other hand the data warehouse uses the Stars, snowflake and fact
constellation schema. In this chapter we will discuss the schemas used in data warehouse.
STAR SCHEMA :
In star schema each dimension is represented with only one dimension table.
This dimension table contains the set of attributes.
In the following diagram we have shown the sales data of a company with respect to the four
dimensions namely, time, item, branch and location.


31

There is a fact table at the centre. This fact table contains the keys to each of four dimensions.
The fact table also contain the attributes namely, dollars sold and units sold.
Note: Each dimension has only one dimension table and each table holds a set of attributes. For
example the location dimension table contains the attribute set
{location_key,street,city,province_or_state,country}. This constraint may cause data redundancy.
For example the "Vancouver" and "Victoria" both cities are both in Canadian province of British
Columbia. The entries for such cities may cause data redundancy along the attributes
province_or_state and country.

What is star schema? The star schema architecture is the simplest data warehouse schema. It
is called a star schema because the diagram resembles a star, with points radiating from a
center. The center of the star consists of fact table and the points of the star are the dimension
tables. Usually the fact tables in a star schema are in third normal form(3NF) whereas
dimensional tables are de-normalized. Despite the fact that the star schema is the simplest
architecture, it is most commonly used nowadays and is recommended by Oracle.

Fact Tables

A fact table typically has two types of columns: foreign keys to dimension tables and measures
those that contain numeric facts. A fact table can contain fact's data on detail or aggregated
level.

Dimension Tables

A dimension is a structure usually composed of one or more hierarchies that categorizes data. If


32
a dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary
keys of each of the dimension tables are part of the composite primary key of the fact table.
Dimensional attributes help to describe the dimensional value. They are normally descriptive,
textual values. Dimension tables are generally small in size then fact table.

Typical fact tables store data about sales while dimension tables data about geographic
region(markets, cities) , clients, products, times, channels.

The main characteristics of star schema:
-> easy to understand schema
-> small number of tables to join
-> de-normalization, redundancy
data caused that size of the table could be large.
SNOWFLAKE SCHEMA :
In Snowflake schema some dimension tables are normalized.
The normalization split up the data into additional tables.
Unlike Star schema the dimensions table in snowflake schema are normalized for example the
item dimension table in star schema is normalized and split into two dimension tables namely,
item and supplier table.

Therefore now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.


33
The supplier key is linked to supplier dimension table. The supplier dimension table contains the
attributes supplier_key, and supplier_type.
<b<>Note: Due to normalization in Snowflake schema the redundancy is reduced therefore it
becomes easy to maintain and save storage space.</b<>
FACT CONSTELLATION SCHEMA :

What is fact constellation schema? For each star schema it is possible to construct fact
constellation schema (for example by splitting the original star schema into more star
schemes each of them describes facts on another level of dimension hierarchies). The fact
constellation architecture contains multiple fact tables that share many dimension tables.

The main shortcoming of the fact constellation schema is a more complicated design because
many variants for particular kinds of aggregation must be considered and selected. Moreover,
dimension tables are still large.
In fact Constellation there are multiple fact tables. This schema is also known as galaxy schema.
In the following diagram we have two fact tables namely, sales and shipping.

The sale fact table is same as that in star schema.


34
The shipping fact table has the five dimensions namely, item_key, time_key, shipper-key, from-
location.
The shipping fact table also contains two measures namely, dollars sold and units sold.
It is also possible for dimension table to share between fact tables. For example time, item and
location dimension tables are shared between sales and shipping fact table.

DATA MINING
Data Mining is defined as extracting the information from the huge set of data. In other words we
can say that data mining is mining the knowledge from data.
Knowledge Discovery and Data Mining (KDD) is an interdisciplinary area focusing upon
methodologies for extracting useful knowledge from data. The ongoing rapid growth of online data
due to the Internet and the widespread use of databases have created an immense need for KDD
methodologies. The challenge of extracting knowledge from data draws upon research in
statistics, databases, pattern recognition, machine learning, data visualization, optimization, and
high-performance computing, to deliver advanced business intelligence and web discovery
solutions.
Introduction
There is huge amount of data available in Information Industry. This data is of no use until
converted into useful information. Analysing this huge amount of data and extracting useful
information from it is necessary.
The extraction of information is not the only process we need to perform, it also involves other
processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining, Pattern
Evaluation and Data Presentation. Once all these processes are over, we are now position to use
this information in many applications such as Fraud Detection, Market Analysis, Production
Control, Science Exploration etc.
What is Data Mining
Data Mining is defined as extracting the information from the huge set of data. In other words we
can say that data mining is mining the knowledge from data. This information can be used for any
of the following applications:
Market Analysis
Fraud Detection
Customer Retention
Production Control


35
Science Exploration
Need of Data Mining
Here are the reasons listed below:
In field of Information technology we have huge amount of data available that need to be turned
into useful information.
This information further can be used for various applications such as market analysis, fraud
detection, customer retention, production control, science exploration etc.
Data Mining Applications
Here is the list of applications of Data Mining:
Market Analysis and Management
Corporate Analysis & Risk Management
Fraud Detection
Data Mining deals with what kind of patterns can be mined. On the basis of kind of data to be
mined there are two kind of functions involved in Data Mining, that are listed below:
Descriptive
Classification and Prediction

Classification Criteria:
Classification according to kind of databases mined
Classification according to kind of knowledge mined
Classification according to kinds of techniques utilized
Classification according to applications adapted
CLASSIFICATION ACCORDING TO KIND OF DATABASES MINED
We can classify the data mining system according to kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data etc. And the
data mining system can be classified accordingly. For example if we classify the database
according to data model then we may have a relational, transactional, object- relational, or data
warehouse mining system.


36
CLASSIFICATION ACCORDING TO KIND OF KNOWLEDGE MINED
We can classify the data mining system according to kind of knowledge mined. It is means data
mining system are classified on the basis of functionalities such as:
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
CLASSIFICATION ACCORDING TO KINDS OF TECHNIQUES UTILIZED
We can classify the data mining system according to kind of techniques used. We can describes
these techniques according to degree of user interaction involved or the methods of analysis
employed.
CLASSIFICATION ACCORDING TO APPLICATIONS ADAPTED
We can classify the data mining system according to application adapted. These applications are
as follows:
Finance
Telecommunications
DNA
Stock Markets
E-mail

DATA MINING FUNCTIONALITIES :
Characterization
Discrimination


37
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis


38


39

DATA MINING SYSTEM CATEGORIZATION AND ITS ISSUES :
Introduction
There is a large variety of Data Mining Systems available. Data mining System may integrate
techniques from the following:
Spatial Data Analysis


40
Information Retrieval
Pattern Recognition
Image Analysis
Signal Processing
Computer Graphics
Web Technology
Business
Bioinformatics
Data Mining System Classification
The data mining system can be classified according to the following criteria:
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines


41

Data mining is an interdisciplinary field, the confluence of a set of disciplines , including database
systems, statistics, machine learning, visualization, and information science. Moreover, depending on
the data mining approach used, techniques from other disciplines may be applied, such as neural
networks, fuzzy and/or rough set theory, knowledge representation, inductive logic programming, or
high performance computing. Depending on the kinds of data to be mined or on the given data mining
application, the data mining system may also integrate techniques from spatial data analysis,
information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web
technology, economics, or psychology.
Because of the diversity of disciplines contributing to data mining, data mining research is expected
to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear
classification of data mining systems. Such a classification may help potential users distinguish data
mining systems and identify those that best match their needs. Data mining systems can be
categorized according to various criteria, as follows.
Classification according to kind of databases mined
Classification according to kind of knowledge mined
Classification according to kinds of techniques utilized
Classification according to applications adapted


42
CLASSIFICATION ACCORDING TO KIND OF DATABASES MINED :
We can classify the data mining system according to kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data etc. And the
data mining system can be classified accordingly. For example if we classify the database
according to data model then we may have a relational, transactional, object- relational, or data
warehouse mining system.
A data mining system can be classified according to the kinds of databases mined. Database
systems themselves can be classified according to different criteria (such as data models, or the
types of data or applications involved), each of which may require its own data mining technique.
Data mining systems can therefore be classified accordingly. For instance, if classifying according
to data models, we may have a relational, transactional, object-oriented, object-relational, or data
warehouse mining system. If classifying according to the special types of data handled, we may
have a spatial, time-series, text, or multimedia data mining system, or a World-Wide Web mining
system. Other system types include heterogeneous data mining systems, and legacy data mining
systems.

CLASSIFICATION ACCORDING TO KIND OF KNOWLEDGE MINED :
We can classify the data mining system according to kind of knowledge mined. It is means data
mining system are classified on the basis of functionalities such as:
Characterization
Discrimination
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Data mining systems can be categorized according to the kinds of knowledge they mine, i.e.,
based on data mining functionalities, such as characterization, discrimination, association,
classification, clustering, trend and evolution analysis, deviation analysis, similarity analysis, etc.
A comprehensive data mining system usually provides multiple and/or integrated data mining
functionalities.


43
Moreover, data mining systems can also be distinguished based on the granularity or levels of
abstraction of the knowledge mined, including generalized knowledge (at a high level of
abstraction), primitive-level knowledge (at a raw data level), or knowledge at multiple levels
(considering several levels of abstraction). An advanced data mining system should facilitate the
discovery of knowledge at multiple levels of abstraction.
CLASSIFICATION ACCORDING TO KINDS OF TECHNIQUES UTILIZED :
We can classify the data mining system according to kind of techniques used. We can describes
these techniques according to degree of user interaction involved or the methods of analysis
employed.
Data mining systems can also be categorized according to the underlying data mining techniques
employed. These techniques can be described according to the degree of user interaction involved
(e.g., autonomous systems, interactive exploratory systems, query-driven systems), or the methods
of data analysis employed (e.g., database-oriented or data warehouse-oriented techniques, machine
learning, statistics, visualization, pattern recognition, neural networks, and so on). A sophisticated data
mining system will often adopt multiple data mining techniques or work out an effective, integrated
technique which combines the merits of a few individual approaches.
CLASSIFICATION ACCORDING TO APPLICATIONS ADAPTED :
We can classify the data mining system according to application adapted. These applications are
as follows:
Finance
Telecommunications
DNA
Stock Markets
E-mail

ISSUES IN DATA MINING :
Introduction
Data mining is not that easy. The algorithm used are very complex. The data is not available at
one place it needs to be integrated form the various heterogeneous data sources. These factors
also creates some issues. Here in this tutorial we will discuss the major issues regarding:
Mining Methodology and User Interaction
Performance Issues
Diverse Data Types Issues


44
The following diagram describes the major issues.

Mining Methodology and User Interaction Issues
It refers to the following kind of issues:
Mining different kinds of knowledge in databases. - The need of different users is not the
same. And Different user may be in interested in different kind of knowledge. Therefore it is
necessary for data mining to cover broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction. - The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on returned results.
Incorporation of background knowledge. - To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be
used to express the discovered patterns not only in concise terms but at multiple level of
abstraction.
Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results. - Once the patterns are discovered it
needs to be expressed in high level languages, visual representations. This representations
should be easily understandable by the users.
Handling noisy or incomplete data. - The data cleaning methods are required that can handle
the noise, incomplete objects while mining the data regularities. If data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.
Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should
be interesting because either they represent common knowledge or lack novelty.


45
Performance Issues
It refers to the following issues:
Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.
Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of
databases, wide distribution of data,and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithm divide the data
into partitions which is further processed parallel. Then the results from the partitions is merged.
The incremental algorithms, updates databases without having mine the data again from scratch.
Diverse Data Types Issues
Handling of relational and complex types of data. - The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system
to mine all these kind of data.
Mining information from heterogeneous databases and global information systems. - The
data is available at different data sources on LAN or WAN. These data source may be structured,
semi structured or unstructured. Therefore mining knowledge from them adds challenges to data
mining.

OTHER ISSUES IN DATA MINING :
Some of these issues are addressed below. Note that these issues are not exclusive and are not
ordered in any way.
Security and social issues: Security is an important issue with any data collection that is shared
and/or is intended to be used for strategic decision-making. In addition, when data is collected for
customer profiling, user behavior understanding, correlating personal data with other information,
etc., large amounts of sensitive and private information about individuals or companies is
gathered and stored. This becomes controversial given the confidential nature of some of this
data and the potential illegal access to the information. Moreover, data mining could disclose new
implicit knowledge about individuals or groups that could be against privacy policies, especially if
there is potential dissemination of discovered information. Another issue that arises from this
concern is the appropriate use of data mining. Due to the value of data, databases of all sorts of
content are regularly sold, and because of the competitive advantage that can be attained from
implicit knowledge discovered, some important information could be withheld, while other
information could be widely distributed and used without control.

User interface issues: The knowledge discovered by data mining tools is useful as long as it is
interesting, and above all understandable by the user. Good data visualization eases the
interpretation of data mining results, as well as helps users better understand their needs. Many
data exploratory analysis tasks are significantly facilitated by the ability to see data in an
appropriate visual presentation. There are many visualization ideas and proposals for effective
data graphical presentation. However, there is still much research to accomplish in order to obtain
good visualization tools for large datasets that could be used to display and manipulate mined


46
knowledge. The major issues related to user interfaces and visualization are "screen real-
estate", information rendering, and interaction. Interactivity with the data and data mining results
is crucial since it provides means for the user to focus and refine the mining tasks, as well as to
picture the discovered knowledge from different angles and at different conceptual levels.

Mining methodology issues: These issues pertain to the data mining approaches applied and
their limitations. Topics such as versatility of the mining approaches, the diversity of data
available, the dimensionality of the domain, the broad analysis needs (when known), the
assessment of the knowledge discovered, the exploitation of background knowledge and
metadata, the control and handling of noise in data, etc. are all examples that can dictate mining
methodology choices. For instance, it is often desirable to have different data mining methods
available since different approaches may perform differently depending upon the data at hand.
Moreover, different approaches may suit and solve user's needs differently.
Most algorithms assume the data to be noise-free. This is of course a strong assumption. Most
datasets contain exceptions, invalid or incomplete information, etc., which may complicate, if not
obscure, the analysis process and in many cases compromise the accuracy of the results. As a
consequence, data preprocessing (data cleaning and transformation) becomes vital. It is often
seen as lost time, but data cleaning, as time-consuming and frustrating as it may be, is one of the
most important phases in the knowledge discovery process. Data mining techniques should be
able to handle noise in data or incomplete information.
More than the size of data, the size of the search space is even more decisive for data mining
techniques. The size of the search space is often depending upon the number of dimensions in
the domain space. The search space usually grows exponentially when the number of dimensions
increases. This is known as the curse of dimensionality. This "curse" affects so badly the
performance of some data mining approaches that it is becoming one of the most urgent issues
to solve.

Performance issues: Many artificial intelligence and statistical methods exist for data analysis
and interpretation. However, these methods were often not designed for the very large data sets
data mining is dealing with today. Terabyte sizes are common. This raises the issues of scalability
and efficiency of the data mining methods when processing considerably large data. Algorithms
with exponential and even medium-order polynomial complexity cannot be of practical use for
data mining. Linear algorithms are usually the norm. In same theme, sampling can be used for
mining instead of the whole dataset. However, concerns such as completeness and choice of
samples may arise. Other topics in the issue of performance are incremental updating, and
parallel programming. There is no doubt that parallelism can help solve the size problem if the
dataset can be subdivided and the results can be merged later. Incremental updating is important
for merging results from parallel mining, or updating data mining results when new data becomes
available without having to re-analyze the complete dataset.

Data source issues: There are many issues related to the data sources, some are practical such
as the diversity of data types, while others are philosophical like the data glut problem. We


47
certainly have an excess of data since we already have more data than we can handle and we
are still collecting data at an even higher rate. If the spread of database management systems
has helped increase the gathering of information, the advent of data mining is certainly
encouraging more data harvesting. The current practice is to collect as much data as possible
now and process it, or try to process it, later. The concern is whether we are collecting the right
data at the appropriate amount, whether we know what we want to do with it, and whether we
distinguish between what data is important and what data is insignificant. Regarding the practical
issues related to data sources, there is the subject of heterogeneous databases and the focus on
diverse complex data types. We are storing different types of data in a variety of repositories. It is
difficult to expect a data mining system to effectively and efficiently achieve good mining results
on all kinds of data and sources. Different kinds of data and sources may require distinct
algorithms and methodologies. Currently, there is a focus on relational databases and data
warehouses, but other approaches need to be pioneered for other specific complex data types. A
versatile data mining tool, for all sorts of data, may not be realistic. Moreover, the proliferation of
heterogeneous data sources, at structural and semantic levels, poses important challenges not
only to the database community but also to the data mining community.

DATA PROCESSING :
What is the need for Data Processing?
To get the required information from huge, incomplete, noisy and inconsistent set of data it is
necessary to use data processing.
Steps in Data Processing:
Data Cleaning
Data Integration
Data Transformation
Data reduction
Data Summarization
What is Data Cleaning?
Data cleaning is a procedure to clean the data by filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies
What is Data Integration?
Integrating multiple databases, data cubes, or files, this is called data integration.
What is Data Transformation?
Data transformation operations, such as normalization and aggregation, are additional data
preprocessing procedures that would contribute toward the success of the mining process.


48
What is Data Reduction?
Data reduction obtains a reduced representation of the data set that is much smaller in volume,
yet produces the same (or almost the same) analytical results.
What is Data Summarization?
It is the processes of representing the collected data in an accurate and compact way without
losing any information, it also involves getting a information from collected data.Ex: Display the
data as a graph and get the mean, median, mode etc.
How to Clean Data?
Handling Missing values
Ignore the tuple
Fill in the missing value manually
Use a global constant to fill in the missing value
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class as the given tuple
Use the most probable value to fill in the missing value.
Handle Noisy Data
Binning: Binning methods smooth a sorted data value by consulting its neighborhood.
Regression: Data can be smoothed by fitting the data to a function, such as with
regression.
Clustering: Outliers may be detected by clustering, where similar values are organized
into groups, or clusters.
Data Integration :
Data Integration combines data from multiple sources into a coherent data store, as in data
warehousing. These sources may include multiple databases, data cubes, or flat files. Issues that
arises during data integration like Schema integration and object matching Redundancy is another
important issue.
Data Transformation
Data transformation can be achieved in following ways
Smoothing: which works to remove noise from the data


49
Aggregation: where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute weekly and annuual
total scores.
Generalization of the data: where low-level or primitive (raw) data are replaced by higher-
level concepts through the use of concept hierarchies. For example, categorical attributes,
like street, can be generalized to higher-level concepts, like city or country.
Normalization: where the attribute data are scaled so as to fall within a small specified
range, such as 1.0 to 1.0, or 0.0 to 1.0.
Attribute construction : this is where new attributes are constructed and added from the
given set of attributes to help the mining process.
Data Reduction techniques
These are the techniques that can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data.
1) Data cube aggregation
2) Attribute subset selection
3) Dimensionality reduction
4) Numerosity reduction
5) Discretization and concept hierarchy generation


50

DATA REDUCTION :
What is Data Reduction?
Data reduction obtains a reduced representation of the data set that is much smaller in volume,
yet produces the same (or almost the same) analytical results.
Data reduction techniques can be applied to obtain a reduced representation of the data set that
is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining
on the reduced data set should be more efcient yet produce the same (or almost the same)
analytical results.
Data Reduction techniques
These are the techniques that can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data.
1) Data cube aggregation
2) Attribute subset selection
3) Dimensionality reduction
4) Numerosity reduction
5) Discretization and concept hierarchy generation


51


52


53


54


55


56


57


58


59


60


61


62


63
DATA MINING STATISTICS :


64


65

DATA MINING TECHNIQUES :
Many different data mining, query model, processing model, and data collection techniques are
available. Which one do you use to mine your data, and which one can you use in combination
with your existing software and infrastructure? Examine different data mining and analytics
techniques and solutions, and learn how to build them using existing software and installations.
Explore the different data mining tools that are available, and learn how to determine whether the
size and complexity of your information might result in processing and storage complexities, and
what to do.


66
This overview provides a description of some of the most common data mining algorithms in
use today. We have broken the discussion into two sections, each with a specific theme:
Classical Techniques: Statistics, Neighborhoods and Clustering
Next Generation Techniques: Trees, Networks and Rules
I. Classical Techniques: Statistics, Neighborhoods and Clustering
1.1. The Classics
These two sections have been broken up based on when the data mining technique was
developed and when it became technically mature enough to be used for business, especially for
aiding in the optimization of customer relationship management systems. Thus this section
contains descriptions of techniques that have classically been used for decades the next section
represents techniques that have only been widely used since the early 1980s.
This section should help the user to understand the rough differences in the techniques and at
least enough information to be dangerous and well armed enough to not be baffled by the vendors
of different data mining tools.
The main techniques that we will discuss here are the ones that are used 99.9% of the time on
existing business problems. There are certainly many other ones as well as proprietary
techniques from particular vendors - but in general the industry is converging to those techniques
that work consistently and are understandable and explainable.
1.2. Statistics
By strict definition "statistics" or statistical techniques are not data mining. They were being used
long before the term data mining was coined to apply to business applications. However,
statistical techniques are driven by the data and are used to discover patterns and build predictive
models. And from the users perspective you will be faced with a conscious choice when solving
a "data mining" problem as to whether you wish to attack it with statistical methods or other data
mining techniques. For this reason it is important to have some idea of how statistical techniques
work and how they can be applied.
What is different between statistics and data mining?
I flew the Boston to Newark shuttle recently and sat next to a professor from one the Boston area
Universities. He was going to discuss the drosophila (fruit flies) genetic makeup to a
pharmaceutical company in New Jersey. He had compiled the world's largest database on the
genetic makeup of the fruit fly and had made it available to other researchers on the internet
through Java applications accessing a larger relational database.
He explained to me that they not only now were storing the information on the flies but also were
doing "data mining" adding as an aside "which seems to be very important these days whatever
that is". I mentioned that I had written a book on the subject and he was interested in knowing
what the difference was between "data mining" and statistics. There was no easy answer.
The techniques used in data mining, when successful, are successful for precisely the same
reasons that statistical techniques are successful (e.g. clean data, a well defined target to predict


67
and good validation to avoid overfitting). And for the most part the techniques are used in the
same places for the same types of problems (prediction, classification discovery). In fact some
of the techniques that are classical defined as "data mining" such as CART and CHAID arose
from statisticians.
So what is the difference? Why aren't we as excited about "statistics" as we are about data
mining? There are several reasons. The first is that the classical data mining techniques such
as CART, neural networks and nearest neighbor techniques tend to be more robust to both
messier real world data and also more robust to being used by less expert users. But that is not
the only reason. The other reason is that the time is right. Because of the use of computers for
closed loop business data storage and generation there now exists large quantities of data that
is available to users. IF there were no data - there would be no interest in mining it. Likewise the
fact that computer hardware has dramatically upped the ante by several orders of magnitude in
storing and processing the data makes some of the most powerful data mining techniques feasible
today.
1.3. Nearest Neighbor
Clustering and the Nearest Neighbor prediction technique are among the oldest techniques used
in data mining. Most people have an intuition that they understand what clustering is - namely
that like records are grouped or clustered together. Nearest neighbor is a prediction technique
that is quite similar to clustering - its essence is that in order to predict what a prediction value is
in one record look for records with similar predictor values in the historical database and use the
prediction value from the record that it nearest to the unclassified record.
A simple example of clustering
A simple example of clustering would be the clustering that most people perform when they do
the laundry - grouping the permanent press, dry cleaning, whites and brightly colored clothes is
important because they have similar characteristics. And it turns out they have important
attributes in common about the way they behave (and can be ruined) in the wash. To cluster
your laundry most of your decisions are relatively straightforward. There are of course difficult
decisions to be made about which cluster your white shirt with red stripes goes into (since it is
mostly white but has some color and is permanent press). When clustering is used in business
the clusters are often much more dynamic - even changing weekly to monthly and many more of
the decisions concerning which cluster a record falls into can be difficult.
A simple example of nearest neighbor
A simple example of the nearest neighbor prediction algorithm is that if you look at the people in
your neighborhood (in this case those people that are in fact geographically near to you). You
may notice that, in general, you all have somewhat similar incomes. Thus if your neighbor has
an income greater than $100,000 chances are good that you too have a high income. Certainly
the chances that you have a high income are greater when all of your neighbors have incomes
over $100,000 than if all of your neighbors have incomes of $20,000. Within your neighborhood
there may still be a wide variety of incomes possible among even your closest neighbors but if
you had to predict someones income based on only knowing their neighbors youre best chance
of being right would be to predict the incomes of the neighbors who live closest to the unknown
person.


68
The nearest neighbor prediction algorithm works in very much the same way except that
nearness in a database may consist of a variety of factors not just where the person lives. It
may, for instance, be far more important to know which school someone attended and what
degree they attained when predicting income. The better definition of near might in fact be other
people that you graduated from college with rather than the people that you live next to.
Nearest Neighbor techniques are among the easiest to use and understand because they work
in a way similar to the way that people think - by detecting closely matching examples. They also
perform quite well in terms of automation, as many of the algorithms are robust with respect to
dirty data and missing data. Lastly they are particularly adept at performing complex ROI
calculations because the predictions are made at a local level where business simulations could
be performed in order to optimize ROI. As they enjoy similar levels of accuracy compared to
other techniques the measures of accuracy such as lift are as good as from any other.
How to use Nearest Neighbor for Prediction
One of the essential elements underlying the concept of clustering is that one particular object
(whether they be cars, food or customers) can be closer to another object than can some third
object. It is interesting that most people have an innate sense of ordering placed on a variety of
different objects. Most people would agree that an apple is closer to an orange than it is to a
tomato and that a Toyota Corolla is closer to a Honda Civic than to a Porsche. This sense of
ordering on many different objects helps us place them in time and space and to make sense of
the world. It is what allows us to build clusters - both in databases on computers as well as in our
daily lives. This definition of nearness that seems to be ubiquitous also allows us to make
predictions.
The nearest neighbor prediction algorithm simply stated is:
Objects that are near to each other will have similar prediction values as well. Thus if you know
the prediction value of one of the objects you can predict it for its nearest neighbors.
Where has the nearest neighbor technique been used in business?
One of the classical places that nearest neighbor has been used for prediction has been in text
retrieval. The problem to be solved in text retrieval is one where the end user defines a document
(e.g. Wall Street Journal article, technical conference paper etc.) that is interesting to them and
they solicit the system to find more documents like this one. Effectively defining a target of: this
is the interesting document or this is not interesting. The prediction problem is that only a very
few of the documents in the database actually have values for this prediction field (namely only
the documents that the reader has had a chance to look at so far). The nearest neighbor
technique is used to find other documents that share important characteristics with those
documents that have been marked as interesting.
Using nearest neighbor for stock market data
As with almost all prediction algorithms, nearest neighbor can be used in a variety of places. Its
successful use is mostly dependent on the pre-formatting of the data so that nearness can be
calculated and where individual records can be defined. In the text retrieval example this was not
too difficult - the objects were documents. This is not always as easy as it is for text retrieval.
Consider what it might be like in a time series problem - say for predicting the stock market. In


69
this case the input data is just a long series of stock prices over time without any particular record
that could be considered to be an object. The value to be predicted is just the next value of the
stock price.
The way that this problem is solved for both nearest neighbor techniques and for some other
types of prediction algorithms is to create training records by taking, for instance, 10 consecutive
stock prices and using the first 9 as predictor values and the 10th as the prediction value. Doing
things this way, if you had 100 data points in your time series you could create 10 different training
records.
You could create even more training records than 10 by creating a new record starting at every
data point. For instance in the you could take the first 10 data points and create a record. Then
you could take the 10 consecutive data points starting at the second data point, then the 10
consecutive data point starting at the third data point. Even though some of the data points would
overlap from one record to the next the prediction value would always be different. In our example
of 100 initial data points 90 different training records could be created this way as opposed to the
10 training records created via the other method.
Why voting is better - K Nearest Neighbors
One of the improvements that is usually made to the basic nearest neighbor algorithm is to take
a vote from the K nearest neighbors rather than just relying on the sole nearest neighbor to the
unclassified record. In Figure 1.4 we can see that unclassified example C has a nearest neighbor
that is a defaulter and yet is surrounded almost exclusively by records that are good credit risks. In
this case the nearest neighbor to record C is probably an outlier - which may be incorrect data or
some non-repeatable idiosyncrasy. In either case it is more than likely that C is a non-defaulter
yet would be predicted to be a defaulter if the sole nearest neighbor were used for the prediction.

Figure 1.4 The nearest neighbors are shown graphically for three unclassified records: A, B, and
C.
In cases like these a vote of the 9 or 15 nearest neighbors would provide a better prediction
accuracy for the system than would just the single nearest neighbor. Usually this is accomplished


70
by simply taking the majority or plurality of predictions from the K nearest neighbors if the
prediction column is a binary or categorical or taking the average value of the prediction column
from the K nearest neighbors.
How can the nearest neighbor tell you how confident it is in the prediction?
Another important aspect of any system that is used to make predictions is that the user be
provided with, not only the prediction, but also some sense of the confidence in that prediction
(e.g. the prediction is defaulter with the chance of being correct 60% of the time). The nearest
neighbor algorithm provides this confidence information in a number of ways:
The distance to the nearest neighbor provides a level of confidence. If the neighbor is very close
or an exact match then there is much higher confidence in the prediction than if the nearest record
is a great distance from the unclassified record.
The degree of homogeneity amongst the predictions within the K nearest neighbors can also be
used. If all the nearest neighbors make the same prediction then there is much higher confidence
in the prediction than if half the records made one prediction and the other half made another
prediction.
1.4. Clustering
Clustering for Clarity
Clustering is the method by which like records are grouped together. Usually this is done to give
the end user a high level view of what is going on in the database. Clustering is sometimes used
to mean segmentation - which most marketing people will tell you is useful for coming up with a
birds eye view of the business. Two of these clustering systems are the PRIZM system from
Claritas corporation and MicroVision from Equifax corporation. These companies have
grouped the population by demographic information into segments that they believe are useful for
direct marketing and sales. To build these groupings they use information such as income, age,
occupation, housing and race collect in the US Census. Then they assign memorable
nicknames to the clusters. Some examples are shown in Table 1.2.
Name Income Age Education Vendor
Blue Blood Estates Wealthy 35-54 College Claritas
Prizm
Shotguns and Pickups Middle 35-64 High School Claritas
Prizm
Southside City Poor Mix Grade School Claritas
Prizm
Living Off the Land Middle-Poor School Age
Families
Low Equifax
MicroVision
University USA Very low Young - Mix Medium to High Equifax
MicroVision
Sunset Years Medium Seniors Medium Equifax
MicroVision
Table 1.2 Some Commercially Available Cluster Tags


71
This clustering information is then used by the end user to tag the customers in their
database. Once this is done the business user can get a quick high level view of what is
happening within the cluster. Once the business user has worked with these codes for some time
they also begin to build intuitions about how these different customers clusters will react to the
marketing offers particular to their business. For instance some of these clusters may relate to
their business and some of them may not. But given that their competition may well be using
these same clusters to structure their business and marketing offers it is important to be aware of
how you customer base behaves in regard to these clusters.
Finding the ones that don't fit in - Clustering for Outliers
Sometimes clustering is performed not so much to keep records together as to make it easier to
see when one record sticks out from the rest. For instance:
Most wine distributors selling inexpensive wine in Missouri and that ship a certain volume of
product produce a certain level of profit. There is a cluster of stores that can be formed with these
characteristics. One store stands out, however, as producing significantly lower profit. On closer
examination it turns out that the distributor was delivering product to but not collecting payment
from one of their customers.
A sale on mens suits is being held in all branches of a department store for
southern California . All stores with these characteristics have seen at least a 100% jump in
revenue since the start of the sale except one. It turns out that this store had, unlike the
others, advertised via radio rather than television.
How is clustering like the nearest neighbor technique?
The nearest neighbor algorithm is basically a refinement of clustering in the sense that they both
use distance in some feature space to create either structure in the data or predictions. The
nearest neighbor algorithm is a refinement since part of the algorithm usually is a way of
automatically determining the weighting of the importance of the predictors and how the distance
will be measured within the feature space. Clustering is one special case of this where the
importance of each predictor is considered to be equivalent.
Nearest Neighbor Clustering
Used for prediction as well as consolidation. Used mostly for consolidating data into a
high-level view and general grouping of
records into like behaviors.
Space is defined by the problem to be solved
(supervised learning).
Space is defined as default n-
dimensional space, or is defined by the
user, or is a predefined space driven by
past experience (unsupervised learning).
Generally only uses distance metrics to
determine nearness.
Can use other metrics besides distance to
determine nearness of two records - for
example linking two points together.


72
II. Next Generation Techniques: Trees, Networks and Rules
2.1. The Next Generation
The data mining techniques in this section represent the most often used techniques that have
been developed over the last two decades of research. They also represent the vast majority of
the techniques that are being spoken about when data mining is mentioned in the popular
press. These techniques can be used for either discovering new information within large
databases or for building predictive models. Though the older decision tree techniques such as
CHAID are currently highly used the new techniques such as CART are gaining wider acceptance.
Decision trees
Related to most of the other techniques (primarily classification and prediction), the decision tree
can be used either as a part of the selection criteria, or to support the use and selection of specific
data within the overall structure. Within the decision tree, you start with a simple question that has
two (or sometimes more) answers. Each answer leads to a further question to help classify or
identify the data so that it can be categorized, or so that a prediction can be made based on each
answer.
Figure 4 shows an example where you can classify an incoming error condition.
Figure 4. Decision tree


73
Decision trees are often used with classification systems to attribute type information, and with
predictive systems, where different predictions might be based on past historical experience that
helps drive the structure of the decision tree and the output.

Neural Networks : What is a Neural Network?
When data mining algorithms are talked about these days most of the time people are talking
about either decision trees or neural networks. Of the two neural networks have probably been
of greater interest through the formative stages of data mining technology. As we will see neural
networks do have disadvantages that can be limiting in their ease of use and ease of deployment,
but they do also have some significant advantages. Foremost among these advantages is their
highly accurate predictive models that can be applied across a large number of different types of
problems.
What is a rule?
In rule induction systems the rule itself is of a simple form of if this and this and this then this. For
example a rule that a supermarket might find in their data collected from scanners would be: if
pickles are purchased then ketchup is purchased. Or
If paper plates then plastic forks
If dip then potato chips
If salsa then tortilla chips
In order for the rules to be useful there are two pieces of information that must be supplied as well
as the actual rule:
Accuracy - How often is the rule correct?
Coverage - How often does the rule apply?
Just because the pattern in the data base is expressed as rule does not mean that it is true all the
time. Thus just like in other data mining algorithms it is important to recognize and make explicit
the uncertainty in the rule. This is what the accuracy of the rule means. The coverage of the rule
has to do with how much of the database the rule covers or applies to. Examples of these two
measure for a variety of rules is shown in Table 2.2.
In some cases accuracy is called the confidence of the rule and coverage is called the
support. Accuracy and coverage appear to be the preferred ways of naming these two
measurements.
Rule Accuracy Coverage
If breakfast cereal purchased then milk purchased. 85% 20%
If bread purchased then swiss cheese purchased. 15% 6%
If 42 years old and purchased pretzels and purchased dry
roasted peanuts then beer will be purchased.
95% 0.01%
Table 2.2 Examples of Rule Accuracy and Coverage


74
SOME IMPORTANT QUESTIONS
Q.1 Discuss in detail the architecture of data warehouse.
Ans : The technical architecture of data warehouses is somewhat similar to other systems, but
does have some special characteristics. There are two border areas in data warehouse
architecture - the single-layer architecture and the N-layer architecture. The difference here is the
number of middleware between the operational systems and the analytical tools. The data
warehouse architecture described here is a high level architecture and the parts in the
architectures mentioned are full bodied systems and not system-parts.
Components of Data Warehouse Architecture
Source Data Component
1. Production Data
2. Internal Data
3. Archived Data
4. External Data
Data Staging Component
1. Data Extraction
2. Data Transformation
3. Data Loading
Data Storage Component
Information Deliver Component
Metadata Component
Management and Control Component


75

Data Warehouses can be architected in many different ways, depending on the specific needs of
a business. The model shown below is the "hub-and-spokes" Data Warehousing architecture
that is popular in many organizations.
In short, data is moved from databases used in operational systems into a data warehouse staging
area, then into a data warehouse and finally into a set of conformed data marts. Data is copied
from one database to another using a technology called ETL (Extract, Transform, Load).


76
Typical Data Warehousing Environment

THREE-TIER DATA WAREHOUSE ARCHITECTURE :
Generally the data warehouses adopt the three-tier architecture. Following are the three tiers of
data warehouse architecture.
Bottom Tier - The bottom tier of the architecture is the data warehouse database server.It is the
relational database system.We use the back end tools and utilities to feed data into bottom
tier.these back end tools and utilities performs the Extract, Clean, Load, and refresh functions.

Middle Tier - In the middle tier we have OLAp Server. the OLAP Server can be implemented in
either of the following ways.
o By relational OLAP (ROLAP), which is an extended relational database management system. The
ROLAP maps the operations on multidimensional data to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly implements multidimensional data and
operations.
Top-Tier - This tier is the front-end client layer. This layer hold the query tools and reporting tool,
analysis tools and data mining tools.
Following diagram explains the Three-tier Architecture of Data warehouse:


77

Q.2 Write short notes on any four of the following :
(1) Distributed data marts
(2) Data mining techniques
Ans :
(1) Distributed data marts :
The data mart could be on different locations from the data warehouse so we should ensure that
the LAN or WAN has the capacity to handle the data volumes being transferred within the data
mart load process.


78

Since the datamarts are smaller they can be placed on smaller distributed machines to allow
users to break away from massively powered machines and still handle processing of the
reports.

(2) Data mining techniques : Explained above.

Q.3 Write in brief bayesian classifiers.
Ans :


79
Introduction
Bayesian classification is based on Baye's Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifier are able to predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
Baye's Theorem
Baye's Theorem is named after Thomas Bayes. There are two types of probability as follows:
Posterior Probability [P(H/X)]
Prior Probability [P(H)]
Where, X is data tuple and H is some hypothesis.
According to Baye's Theorem
P(H/X)= P(X/H)P(H) / P(X)
Bayesian Belief Network
Bayesian Belief Network specify joint conditional probability distributions
Bayesian Networks and Probabilistic Network are known as belief network.
Bayesian Belief Network allows class conditional independencies to be defined between subsets
of variables.
Bayesian Belief Network provide a graphical model of causal relationship on which learning can
be performed.
We can use the trained Bayesian Network for classification. Following are the names with which
the Bayesian Belief are also known:
Belief networks
Bayesian networks
Probabilistic networks
There are two components to define Bayesian Belief Network:
Directed acyclic graph
A set of conditional probability tables
Directed Acyclic Graph
Each node in directed acyclic graph is represents a random variable.


80
These variable may be discrete or continuous valued.
These variable may corresponds to actual attribute given in data.
Directed Acyclic Graph Representation
The following diagram shows a directed acyclic graph for six boolean variables.

The arc in the diagram allows representation of causal knowledge. For example lung cancer is
influenced by a person's family history of lung cancer, as well as whether or not the person is a
smoker.It is woth noting that the variable PositiveXRay is independent of whether the patient has
a family history of lung cancer or is a smoker, given that we know the patient has lung cancer.
Set of Conditional probability table representation:
The conditional probability table for the values of the variable LungCancer (LC) showing each
possible combination of the values of its parent nodes, FamilyHistory (FH) and Smoker (S).

Q.4 What is the role of information visualization in data mining ?
Ans :
Data mining provides many useful results but is difficult to implement. The choice of data mining
technique is not easy and expertise in the domain of interest is required. If one could travel over
the data set of interest, much as a plane flies over a landscape with the occupants identifying
points of interest, the task of data mining would be much simpler. Just as population centers are
noted and isolated communities are identified in a landscape so clusters of data instances and


81
isolated instances might be identified in a data set. Identification would be natural and understood
by all. At the moment no such general-purpose visualization techniques exist for multi-
dimensional data. The visualization techniques available are either crude or limited to particular
domains of interest. They are used in an exploratory way and require confirmation by other more
formal data mining techniques to be certain about what is revealed. Visualization of data to make
information more accessible has been used for centuries. The work of Tufte provides a
comprehensive review of some of the better approaches and examples from the past . The
interactive nature of computers and the ability of a screen display to change dynamically have led
to the development of new visualization techniques. Researchers in computer graphics are
particularly active in developing new visualizations. These researchers have adopted the term
Visualization to describe representations of various situations in a broad way.
Physical problems such as volume and flow analysis have prompted researchers to develop a
rich set of paradigms for visualization of their application areas [Niel96 p.97]. The term Information
Visualization has been adopted as a more specific description of the visualization of data that are
not necessarily representations of 4 physical systems which have their inherent semantics
embedded in three dimensional space .
Consider a multi-dimensional data set of US census data on individuals. Each individual
represents an entity instance and each entity instance has a
number of attributes. In a relational database a row of data in a table would be equivalent to an
entity instance. Each column in that row would contain a value equivalent to an attribute value of
that entity instance. The data set is multidimensional; the number of attributes being equal and
equivalent to the number of dimensions. The attributes in the chosen example are occupation,
age, gender, income, marital status, level of education and birthplace. The data is categorical
because the values of the attributes for each instance may only be chosen from certain categories.
For example gender may only take a value from the categories male or female. Information
Visualization is concerned with multi-dimensional data that may be less structured than data sets
grounded in some physical system. Physical systems have inherent semantics in a three
dimensional space. Multi-dimensional data, in contrast, may have some dimensions containing
values that fall into categories instead of being continuous over a range. This is the case for the
data collected in many fields of study.

Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advance Concept in Data Bases Unit-5 by Arun Pratap Singh

Uploaded by

Copyright:

Available Formats

PREPARED BY ARUN PRATAP SINGH MTECH 2nd SEMESTER

PREPARED BY ARUN PRATAP SINGH 1

You might also like