You are on page 1of 20

Unit - I

Introduction to Data Warehouse:

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection


of data in support of management's decision-making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example, source
A and source B may have different ways of identifying a product, but in a data warehouse,
there will be only a single way of identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse
can hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a
data warehouse should never be altered.

Multidimensional Data Model


Multidimensional data model stores data in the form of data cube. Mostly, data warehousing
supports two or three-dimensional cubes.
A data cube allows data to be viewed in multiple dimensions. A dimension are entities with
respect to which an organization wants to keep records. For example in store sales record,
dimensions allow the store to keep track of things like monthly sales of items and the
branches and locations.
A multidimensional database helps to provide data-related answers to complex business
queries quickly and accurately.
Data warehouses and Online Analytical Processing (OLAP) tools are based on a
multidimensional data model. OLAP in data warehousing enables users to view data from
different angles and dimensions.

Schemas for Multidimensional Model


Star Schema
The simplest data warehouse schema is star schema because its structure resembles a star. Star
schema consists of data in the form of facts and dimensions. The fact table present in the center
of star and points of the star are the dimension tables.
In star schema fact table contain a large amount of data, with no redundancy. Each dimension
table is joined with the fact table using a primary or foreign key.

Fact Tables
A fact table has two types of columns: one column of foreign keys (pointing to the dimension
tables) and other of numeric values.
Dimension Tables
Dimension table is generally small in size as compared to a fact table. The primary key of a
dimension table is a foreign key in a fact table.
Example of Dimension Tables are: -
 Time dimension table
 Product dimension table
 Employee dimension table
 Geography dimension table
The main characteristics of star schema are that it is easy to understand and small number of
tables can join.

Snowflake Schema

The snowflake schema is a more complex than star schema because dimension tables of the
snowflake are normalized.
The snowflake schema is represented by centralized fact table which is connected to multiple
dimension table and this dimension table can be normalized into additional dimension tables.

The major difference between the snowflake and star schema models is that the dimension
tables of the snowflake model are normalized to reduce redundancies.

Fact Constellation Schema

A fact constellation can have multiple fact tables that share many dimension tables. This type
of schema can be viewed as a collection of stars, Snowflake and hence is called a galaxy
schema or a fact constellation.
The main disadvantage of fact constellation schemas is its more complicated design.

Data Warehouse Architecture:


Tier-1:
The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom tier
from operational databases or other external sources (such as customer profile
information provided by external consultants). These tools and utilities perform data
extraction, cleaning, and transformation (e.g., to merge similar data from different
sources into a unified format), as well as load and refresh functions to update the data
warehouse. The data are extracted using application program interfaces known as
gateways. A gateway is supported by the underlying DBMS and allows client
programs to generate SQL code to be executed at a server.

Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open
Linking and Embedding for Databases) by Microsoft and JDBC (Java Database
Connection).
 Data extraction - get data from multiple, heterogeneous, and external sources
 Data cleaning - detect errors in the data and rectify them when possible
 Data transformation - convert data from legacy or host format to warehouse
format
 Load - sort, summarize, consolidate, compute views, check integrity, and
build indices and partitions
 Refresh - propagate the updates from the data sources to the warehouse

This tier also contains a metadata repository, which stores information about the data
warehouse and its contents.

Tier-2:

The middle tier is an OLAP server that is typically implemented using either a
relational
OLAP (ROLAP) model or a multidimensional
OLAP.
OLAP model is an extended relational DBMS that maps operations on
multidimensional data to standard relational operations.

A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that


directly implements multidimensional data and operations.

Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
(or)
Data Warehouse
Implementation
There are various implementation in data warehouses which are as follows
1. Requirements analysis and capacity planning: The first process in data warehousing
involves defining enterprise needs, defining architectures, carrying out capacity planning, and
selecting the hardware and software tools. This step will contain be consulting senior
management as well as the different stakeholder.

2. Hardware integration: Once the hardware and software has been selected, they require to be
put by integrating the servers, the storage methods, and the user software tools.

3. Modeling: Modelling is a significant stage that involves designing the warehouse schema and
views. This may contain using a modeling tool if the data warehouses are sophisticated.

4. Physical modeling: For the data warehouses to perform efficiently, physical modeling is
needed. This contains designing the physical data warehouse organization, data placement, data
partitioning, deciding on access techniques, and indexing.

5. Sources: The information for the data warehouse is likely to come from several data sources.
This step contains identifying and connecting the sources using the gateway, ODBC drives, or
another wrapper.

6. ETL: The data from the source system will require to go through an ETL phase. The process
of designing and implementing the ETL phase may contain defining a suitable ETL tool vendors
and purchasing and implementing the tools. This may contains customize the tool to suit the
need of the enterprises.

7. Populate the data warehouses: Once the ETL tools have been agreed upon, testing the tools
will be needed, perhaps using a staging area. Once everything is working adequately, the ETL
tools may be used in populating the warehouses given the schema and view definition.

8. User applications: For the data warehouses to be helpful, there must be end-user
applications. This step contains designing and implementing applications required by the end-
users.
9. Roll-out the warehouses and applications: Once the data warehouse has been populated and
the end-client applications tested, the warehouse system and the operations may be rolled out for
the user's community to use.

Integration of a Data Mining System with a Database or Data Warehouse


System
The possible integration schemes are as follows.
No coupling:
Data mining system will not utilize any function of a Database or Data warehouse
system. It may fetch data from a particular source (such as a file system), process data using
some data mining algorithms, and then the mining results in another file.
Loose coupling:
Data mining system will use some facilities of a Database or Data warehouse
system fetching data from a data repository managed by these systems, performing data mining,
and then storing the mining results either in a file or in a designated place in a database or data
warehouse.
Semitight coupling:
Besides linking a Data mining system to Database /Data warehouse system,
efficient implementations of a few essential data mining primitives can be provided in the
Database/Data warehouse system.
These primitives can include sorting, indexing, aggregation, histogram analysis,
multiway join, and precomputation of some essential statistical measure, such as sum, count,
max, min, standard deviation, and so on.
Tight coupling:
Data mining system is smoothly integrated into the Database/Data warehouse
system. The data mining subsystem is treated as one functional component of an information
system.
Data Mining
It is a process of discovering interesting knowledge from large amounts of data stored either in
Databases, data warehouses, or other information repositories

Functionalities/Techniques:
 Concept/Class Description: Characterization and Discrimination
 Mining Frequent Patterns, Associations and correlations
 Classification and Prediction
 Cluster Analysis
 Outlier Analysis
 Evolution Analysis
Data Characterization: A data mining system should be able to produce a description
summarizing the characteristics of customers.
Example: The characteristics of customers who spend more than $1000 a year at (some store
called ) AllElectronics. The result can be a general profile such as age, employment status or
credit ratings.
Data Discrimination: It is a comparison of the general features of targeting class data objects
with the general features of objects from one or a set of contrasting classes. User can specify
target and contrasting classes.
Example: The user may like to compare the general features of software products whose sales
increased by 10% in the last year with those whose sales decreased by about 30% in the same
duration.
Frequent Patterns : as the name suggests patterns that occur frequently in data.
Association Analysis: from marketing perspective, determining which items are frequently
purchased together within the same transaction.
Example: An example is mined from the (some store) AllElectronic transactional database.
buys (X, “Computers”)  buys (X, “software”) [Support = 1%, confidence = 50% ]
 X represents customer
 confidence = 50% , if a customer buys a computer there is a 50% chance that he/she will
buy software as well.
 Support = 1%, means that 1% of all the transactions under analysis showed that
computer and software were purchased together.
 Another example:
 Age (X, 20…29) ^ income (X, 20K-29K)  buys(X, “CD Player”) [Support = 2%,
confidence = 60% ]
Customers between 20 to 29 years of age with an income $20000-$29000. There is 60% chance
they will purchase CD Player and 2% of all the transactions under analysis showed that this age
group customers with that range of income bought CD Player.
Classification: Classification is the process of finding a model that describes and distinguishes
data classes or concepts for the purpose of being able to use the model to predict the class of
objects whose class label is unknown.
Classification model can be represented in various forms such as
 IF-THEN Rules
 A decision tree
 Neural network
Cluster Analysis
Clustering analyses data objects without consulting a known class label.
Example: Cluster analysis can be performed on AllElectronics customer data in order to identify
homogeneous subpopulations of customers. These clusters may represent individual target
groups for marketing. The figure on next slide shows a 2-D plot of customers with respect to
customer locations in a city.

Outlier Analysis : A database may contain data objects that do not comply with the general
behavior or model of the data. These data objects are outliers.
Example: Use in finding Fraudulent usage of credit cards. Outlier Analysis may uncover
Fraudulent usage of credit cards by detecting purchases of extremely large amounts for a given
account number in comparison to regular charges incurred by the same account. Outlier values
may also be detected with respect to the location and type of purchase or the purchase frequency.

Evolution Analysis: Data evolution analysis describes and models regularities or trends for
objects whose behavior changes over time.
Example: Time-series data. If the stock market data (time-series) of the last several years
available from the New York Stock exchange and one would like to invest in shares of high tech
industrial companies. A data mining study of stock exchange data may identify stock evolution
regularities for overall stocks and for the stocks of particular companies. Such regularities may
help predict future trends in stock market prices, contributing to one’s decision making regarding
stock investments.

Data Mining Task Primitives

We can specify a data mining task in the form of a data mining query.This query is input to the
system.A data mining query is defined in terms of data mining task primitives.
Here is the list of Data Mining Task Primitives −
Set of task relevant data to be mined.
 Kind of knowledge to be mined.
 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.
 Set of task relevant data to be mined
This is the portion of database in which the user is interested. This portion includes the
following −
 Database Attributes
 Data Warehouse dimensions of interest
Kind of knowledge to be mined
It refers to the kind of functions to be performed. These functions are −
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Clustering
 Outlier Analysis
 Evolution Analysis
Background knowledge

The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data to
be mined at multiple levels of abstraction.
Interestingness measures and thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process of knowledge
discovery. There are different interesting measures for different kind of knowledge.
Representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be displayed. These
representations may include the following. −
 Rules
 Tables
 Charts
 Graphs
 Decision Trees
 Cubes

Major Issues in Data Mining


Mining different kinds of knowledge in databases. - The need of different users
is not the same. And Different user may be in interested in different kind of knowledge.
Therefore
it is necessary for data mining to cover broad range of knowledge discovery task.

Interactive mining of knowledge at multiple levels of abstraction. - The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing
and
refining data mining requests based on returned results.

Incorporation of background knowledge. - To guide discovery process and to express the


discovered patterns, the background knowledge can be used. Background knowledge may
be used to express the discovered patterns not only in concise terms but at multiple level of
abstraction.

Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.

Presentation and visualization of data mining results. - Once the patterns are discovered it
needs to be expressed in high level languages, visual representations. These representations
should be easily understandable by the users.

Handling noisy or incomplete data. - The data cleaning methods are required that can handle
the noise, incomplete objects while mining the data regularities. If data cleaning methods are
not there then the accuracy of the discovered patterns will be poor.

Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should
be interesting because either they represent common knowledge or lack novelty.

You might also like