DATA Ware House & Mining NOTES

DATA WAREHOUSE
What is a Data Warehouse?

 Relational database that is designed for query and analysis rather than
transaction processing.
 Any centralized data repository which can be queried for business benefits.
 It is a database that stores information oriented to satisfy decision-making
requests.
 Group of data specific to the entire organization, not only to a particular
group of users.
 Data warehouse environment contains an extraction, transportation, and
loading (ETL) solution, an online analytical processing (OLAP) engine,
customer analysis tools, and other applications that handle the process of
gathering information and delivering it to business users.
 Support architectures and tool for business executives to systematically
organize, understand and use their information to make strategic decisions.
A Data Warehouse can be viewed as a data system with the following

attributes:
o It is a database designed for investigative tasks, using data from various

applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of
information.
o Its usage is read-intensive.
o It contains a few large tables.
Characteristics of Data Warehouse
Subject-Oriented
Provide a concise and straightforward view around a particular subject, such as
customer, product, or sales, instead of the global organization's ongoing operations.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat
files, and online transaction records.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data warehouse.
Non-Volatile
Non-Volatile defines that once entered into the warehouse, and data should not
change.
Goals of Data Warehousing
o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.
Need for Data Warehouse

Data Warehouse is needed for the following reasons:
1. Business User: Business users require a data warehouse to view summarized

data from the past. Since these people are non-technical, the data may be
presented to them in an elementary form.
2. Store historical data: Data Warehouse is required to store the time variable
data from the past. This input is made to be used for various purposes.
3. Make strategic decisions: Some strategies may be depending upon the data
in the data warehouse. So, data warehouse contributes to making strategic
decisions.
4. For data consistency and quality: Bringing the data from different sources at
a commonplace, the user can effectively undertake to bring the uniformity
and consistency in data.
5. High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant degree of
flexibility and quick response time.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.

2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.
4. Queries that would be complex in many normalized databases could be easier
to build and maintain in data warehouses.
5. Data warehousing provide the capabilities to analyse a large amount of
historical data.
COMPONENTS OR BUILDING BLOCKS OF DATA WAREHOUSE
Source Data Component

Production Data: This type of data comes from the different operating systems of
the enterprise.
Internal Data: In each organization, the client keeps their "private" spreadsheets,
reports, customer profiles, and sometimes even department databases.
Archived Data: In every operational system, we periodically take the old data and
store it in achieved files.
External Data: Most executives depend on information from external sources for a

large percentage of the information they use.
Data Staging Component
After we have been extracted data from various operational systems and external
sources, we have to prepare the files for storing in the data warehouse. The extracted
data coming from several different sources need to be changed, converted, and
made ready in a format that is relevant to be saved for querying and analysis.
We will now discuss the three primary functions that take place in
the staging area.
1) Data Extraction: This method has to deal with numerous data sources. We have
to employ the appropriate techniques for each data source.
2) Data Transformation: If data extraction for a data warehouse posture big

challenges, data transformation present even significant challenges. We perform
several individual tasks as part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the correction
of misspellings or may deal with providing default values for missing data
elements, or elimination of duplicates when we bring in the same data from
various source systems.
Standardization of data components forms a large part of data transformation. We

combine data from single source record or related data parts from many source
records.
3) Data Loading: Two distinct categories of tasks form data loading functions. When
we complete the structure and construction of the data warehouse and go live for
the first time. The initial load moves high volumes of data using up a substantial
amount of time.
Data Storage Components

Data storage for the data warehousing is a split repository. The data repositories for
the operational systems generally include only the current data. Also, these data
repositories include the data structured in highly normalized for fast and efficient
processing.
Information Delivery Component
The information delivery element is used to enable the process of subscribing for
data warehouse files and having it transferred to one or more destinations according
to some customer-specified scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalogue in
a database management system. In the data dictionary, we keep the data about the
logical data structures, the data about the records and addresses, the information
about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of
users. Data marts are lower than data warehouses and usually contain organization.
The current trends in data warehousing are to developed a data warehouse with
several smaller related data marts for particular kinds of queries and reports.
Management and Control Component

The management and control elements coordinate the services and functions within
the data warehouse. These components control the data transformation and the
data transfer into the data warehouse storage. On the other hand,
 it moderates the data delivery to the clients.
 Its work with the database management systems and authorizes data to be
correctly saved in the repositories.
 It monitors the movement of information into the staging method and from
there into the data warehouses storage itself.
Difference between Database and Data Warehouse
Database Data Warehouse
1. It is used for Online Transactional 1. It is used for Online Analytical

Processing (OLTP) . Processing (OLAP). .
3. Data is dynamic 3. Data is largely static
4. Entity: Relational modelling procedures 4. Data: Modelling approach are used

are used for RDBMS database design. for the Data Warehouse design.
5. Optimized for write operations. 5. Optimized for read operations.
6. Performance is low for analysis queries. 6. High performance for analytical

queries.
7. The data is taken as a base and 7. The application data is handled for
managed to get available fast and analysis and reporting objectives.
efficient access.
The Operational Database is the source of information for the data warehouse. It
includes detailed information used to run the day to day operations of the business
Data Warehouse Architecture

A data warehouse architecture is a method of defining the overall architecture of
data communication processing and presentation that exist for end-clients
computing within the enterprise.
Three common architectures are:
o Data Warehouse Architecture: Basic

o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Architecture: Basic

Operational System: Refer to a system that is used to process the day-to-day
transactions of an organization.
Flat Files : system of files in which transactional data is stored, and every file in the
system must have a different name.
Meta Data : A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding
and work with particular instances of data more accessible. For example, author, data
build, and data changed, and file size are examples of very basic document
metadata.
Lightly and highly summarized data: The goals of the summarized information are
to speed up query performance. The summarized record is updated continuously as
new information is loaded into the warehouse.
Data Warehouse Architecture: With Staging Area

Staging area (A place where data is processed before entering the warehouse).
Data Warehouse Staging Area is a temporary location where a record from source
systems is copied
A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems.
Data Warehouse Architecture: With Staging Area and

Data Marts
We may want to customize our warehouse's architecture for multiple groups within
our organization.
We can do this by adding data marts. A data mart is a segment of a data

warehouses that can provided information for reporting and analysis on a section,
unit, department or operation in the company, e.g., sales, payroll, production, etc.
Properties of Data Warehouse Architectures
1. Separation: Analytical and transactional processing should be keep apart as much
as possible.
2. Scalability: Hardware and software architectures should be simple to upgrade the

data volume, which has to be managed and processed, and the number of user's
requirements, which have to be met, progressively increase.
3. Extensibility: The architecture should be able to perform new operations and

technologies without redesigning the whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored

in the data warehouses.
5. Administrability: Data Warehouse management should not be complicated.
Types of Data Warehouse Architectures

Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and
data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference

data model for a whole enterprise. At the same time, it separates the problems of
source data extraction and integration from those of data warehouse population.
Data Warehouses usually have a three-level (tier) architecture that includes:
1. Bottom Tier (Data Warehouse Server)

2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).
A bottom-tier that consists of the Data Warehouse server, which is almost always

an RDBMS. It may include several specialized data marts and a metadata repository.
A middle-tier which consists of an OLAP server for fast querying of the data
warehouse.
A top-tier that contains front-end tools for displaying results provided by OLAP, as

well as additional tools for data mining of the OLAP-generated data.
Principles of Data Warehousing
Load Performance
Data warehouses require increase loading of new data periodically basis within
narrow time windows; performance on the load process should be measured in
hundreds of millions of rows and gigabytes per hour and must not artificially
constrain the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data warehouse,
including data conversion, filtering, reformatting, indexing, and metadata update.
Data Quality Management
Fact-based management demands the highest data quality. The warehouse ensures
local consistency, global consistency, and referential integrity despite "dirty" sources
and massive database size.
Query Performance
Fact-based management must not be slowed by the performance of the data

warehouse RDBMS; large, complex queries must be complete in seconds, not days.
Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size from a few
to hundreds of gigabytes and terabyte-sized data warehouses.
ETL (Extract, Transform, and Load)

Process
What is ETL?
The mechanism of extracting information from source systems and bringing it into
the data warehouse is commonly called ETL, which stands for Extraction,
Transformation and Loading.
 ETL stands for Extract, Transform, and Load, while ELT stands for Extract,
Load, and Transform.
 In ETL, data flow from the data source to staging to the data destination.
 ELT lets the data destination do the transformation, eliminating the need for
data staging.
 ETL can help with data privacy and compliance, cleansing sensitive data
before loading into the data destination, while ELT is simpler and for
companies with minor data needs.
Difference between ETL vs. ELT

Basics ETL ELT
Process Data is transferred to the ETL Data remains in the DB except

server and moved back to DB. for cross Database loads (e.g.
High network bandwidth source to object).
required.
Transformation Transformations are performed Transformations are performed

in ETL Server. (in the source or) in the target.
Code Usage Typically used for Typically used for

o Source to target transfer o High amounts of data
o Compute-intensive
Transformations
o Small amount of data
Time- It needs highs maintenance as Low maintenance as data is

Maintenance you need to select data to load always available.
and transform.
Calculations Overwrites existing column or Easily add the calculated column

Need to append the dataset and to the existing table.
push to the target platform.
Types of Data Warehouses

What is Data Mining?
 Most useful techniques that help entrepreneurs, researchers, and individuals

to extract valuable information from huge sets of data.
 Also called Knowledge Discovery in Database (KDD): Data cleaning, Data

integration, Data selection, Data transformation, Data mining, Pattern
evaluation, and Knowledge presentation.
 The process of extracting information to identify patterns, trends, and

useful data that would allow the business to take the data-driven decision
from huge sets of data is called Data Mining.
 Data mining utilizes complex mathematical algorithms for data segments

and evaluates the probability of future events.
 Data Mining is similar to Data Science carried out by a person, in a specific

situation, on a particular data set, with an objective.
Data Mining Applications

Data Mining is primarily used by organizations with intense consumer demands-
Retail, determine price, consumer preferences, product positioning, and impact on
sales, customer satisfaction, and corporate profits. Data mining enables a retailer
to use point-of-sale records of customer purchases to develop products and
promotions that help the organization to attract the customer.
These are the following areas where data mining is widely used:
Data Mining in Healthcare:
Data mining in healthcare has excellent potential to improve the health system. It
uses data and analytics for better insights and to identify best practices that will
enhance health care services and reduce costs.
Data Mining in Market Basket Analysis:
This technique may enable the retailer to understand the purchase behavior of a
buyer. This data may assist the retailer in understanding the requirements of the
buyer and altering the store's layout accordingly.
Data mining in Education:

An organization can use data mining to make precise decisions and also to predict
the results of the student. With the results, the institution can concentrate on what to
teach and how to teach.
Data Mining in Manufacturing Engineering:
Data mining tools can be beneficial to find patterns in a complex manufacturing

process. Data mining can be used in system-level designing to obtain the
relationships between product architecture, product portfolio, and data needs of the
customers.
Data Mining in CRM (Customer Relationship Management):
To get a decent relationship with the customer, a business organization needs to

collect data and analyse the data. With data mining technologies, the collected data
can be used for analytics.
Data Mining in Fraud detection:
A model is constructed using the data, and the technique is made to identify whether
the document is fraudulent or not.
Data Mining in Lie Detection:
Apprehending a criminal is not a big deal, but bringing out the truth from him is a
very challenging task. Law enforcement may use data mining techniques to
investigate offenses, monitor suspected terrorist communications, etc.
Data Mining Financial Banking:
The Digitalization of the banking system is supposed to generate an enormous

amount of data with every new transaction. The data mining technique can help the
manager to find the data for better targeting, acquiring, retaining, segmenting, and
maintain a profitable customer.
Challenges of Implementation in Data mining
Although data mining is very powerful, it faces many challenges during its execution.
Various challenges could be related to performance, data, methods, and techniques,
etc. The process of data mining becomes effective when the challenges or problems
are correctly recognized and adequately resolved.
Incomplete and noisy data:
The process of extracting useful data from large volumes of data is data mining. The
data in the real-world is heterogeneous, incomplete, and noisy. Data in huge
quantities will usually be inaccurate or unreliable. These problems may occur due to
data measuring instrument or because of human errors. All these consequences
(noisy and incomplete data)makes data mining challenging.
Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing

environment. It is a quite tough task to make all the data to a centralized data
repository mainly due to organizational and technical concerns.
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio

and video, images, complex data, spatial data, time series, and so on. Managing
these various types of data and extracting useful information is a tough task.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms
and techniques used. If the designed algorithm and techniques are not up to the
mark, then the efficiency of the data mining process will be affected adversely.
Data Privacy and Security:
Data mining usually leads to serious issues in terms of data security, governance, and
privacy. For example, if a retailer analyses the details of the purchased items, then it
reveals data about buying habits and preferences of the customers without their
permission.
Data Visualization:
It is the primary method that shows the output to the user in a presentable way. The
extracted data should convey the exact meaning of what it intends to express. But
many times, representing the information to the end-user in a precise and easy way
is difficult. Very efficient, and successful data visualization processes need to be
implemented to make it successful.
Data Mining Techniques
In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.
Data Mining Architecture
The significant components of data mining systems are a data source, data mining
engine, data warehouse server, the pattern evaluation module, graphical user
interface, and knowledge base.
KDD- Knowledge Discovery in Databases
The term KDD stands for Knowledge Discovery in Databases. It refers to the broad
procedure of discovering knowledge in data and emphasizes the high-level
applications of specific Data Mining techniques. It is a field of interest to researchers
in various fields, including artificial intelligence, machine learning, pattern
recognition, databases, statistics, knowledge acquisition for expert systems, and data
visualization.
The main objective of the KDD process is to extract information from data in the
context of large databases. It does this by using Data Mining algorithms to identify
what is deemed knowledge.
The KDD Process

The knowledge discovery process(illustrates in the given figure) is iterative and
interactive, comprises of nine steps. Thus, it is needed to understand the process and
the different requirements and possibilities in each stage.
The process begins with determining the KDD objectives and ends with the
implementation of the discovered knowledge.
Learning
 Supervised Learning: is one that relies on labelled input data to learn a

function that produces an appropriate output when given new unlabelled
data. Supervised machine learning algorithms are used to solve
classification or regression problems. It tries to learn a function that will
allow us to make predictions given some new unlabelled data
 Unsupervised Learning: makes use of input data without any labels —in
other words, no teacher (label) telling the child (computer) when it is right
or when it has made a mistake so that it can self-correct. It tries to learn the
basic structure of the data to give us more insight into the data.
Classification
 A classification problem has a discrete

value as its output. For example, “likes
pineapple on pizza” and “does not like
pineapple on pizza” are discrete. There is no
middle ground.
 It is standard practice to represent the output

(label) of a classification algorithm as an
integer number such as 1, -1, or 0.
Regression
 A regression problem has a real number (a number with a decimal point) as its
output.
 We have an independent variable (or set of independent variables) and a
dependent variable (the thing we are trying to guess given our independent
variables).
 Also, each row is typically called an example, observation, or data point, while
each column (not including the label/dependent variable) is often called
a predictor, dimension, independent variable, or feature.
Decision Tree
 A decision tree is a structure that includes a root node, branches, and leaf
nodes.
 Each internal node denotes a test on an attribute, each branch denotes the
outcome of a test, and each leaf node holds a class label.
 The topmost node in the tree is the root node.
 Can be applied to classification and prediction.
 Generate rules: simple interpretation & understanding.
 The benefits of having a decision tree are as follows −
 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and
fast.
Example
Terminology
 Information gain: measure of how much information the answer to a

specific question provides.
 Entropy: measure of uncertainty in the information.
 If information gain increases, entropy decreases.
Q Major-issues in Data Mining
A: Some of the Data mining challenges are given as under:
 Security and Social Challenges.
 Noisy and Incomplete Data.
 Distributed Data.
 Complex Data.
 Performance.
 Scalability and Efficiency of the Algorithms.
 Improvement of Mining Algorithms.
 Incorporation of Background Knowledge.
What is Metadata?
Metadata is simply defined as data about data. The data that is used to represent
other data is known as metadata. For example, the index of a book serves as a
metadata for the contents in the book. In other words, we can say that metadata is
the summarized data that leads us to detailed data. In terms of data warehouse, we
can define metadata as follows.
 Metadata is the road-map to a data warehouse.
 Metadata in a data warehouse defines the warehouse objects.
 Metadata acts as a directory. This directory helps the decision support
system to locate the contents of a data warehouse.
To ensure high-quality data, it’s crucial to pre-process it. To make the process easier, data
pre-processing is divided into four stages: data cleaning, data integration, data reduction,
and data transformation

DATA Ware House & Mining NOTES

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DATA Ware House & Mining NOTES

Uploaded by

Copyright:

Available Formats

DATA WAREHOUSE

What is a Data Warehouse?

A Data Warehouse can be viewed as a data system with the following

o It is a database designed for investigative tasks, using data from various

Need for Data Warehouse

1. Business User: Business users require a data warehouse to view summarized

Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.

Source Data Component

External Data: Most executives depend on information from external sources for a

2) Data Transformation: If data extraction for a data warehouse posture big

Standardization of data components forms a large part of data transformation. We

Data Storage Components

Management and Control Component

 it moderates the data delivery to the clients.

Difference between Database and Data Warehouse

Database Data Warehouse

1. It is used for Online Transactional 1. It is used for Online Analytical

3. Data is dynamic 3. Data is largely static

4. Entity: Relational modelling procedures 4. Data: Modelling approach are used

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis queries. 6. High performance for analytical

Data Warehouse Architecture

Three common architectures are:

o Data Warehouse Architecture: Basic

Data Warehouse Architecture: Basic

Data Warehouse Architecture: With Staging Area

Data Warehouse Architecture: With Staging Area and

We can do this by adding data marts. A data mart is a segment of a data

2. Scalability: Hardware and software architectures should be simple to upgrade the

3. Extensibility: The architecture should be able to perform new operations and

4. Security: Monitoring accesses are necessary because of the strategic data stored

5. Administrability: Data Warehouse management should not be complicated.

Types of Data Warehouse Architectures

The main advantage of the reconciled layer is that it creates a standard reference

Data Warehouses usually have a three-level (tier) architecture that includes:

1. Bottom Tier (Data Warehouse Server)

A bottom-tier that consists of the Data Warehouse server, which is almost always

A top-tier that contains front-end tools for displaying results provided by OLAP, as

Principles of Data Warehousing

Data Quality Management

Fact-based management must not be slowed by the performance of the data

ETL (Extract, Transform, and Load)

Difference between ETL vs. ELT

Process Data is transferred to the ETL Data remains in the DB except

Transformation Transformations are performed Transformations are performed

Code Usage Typically used for Typically used for

Time- It needs highs maintenance as Low maintenance as data is

Calculations Overwrites existing column or Easily add the calculated column

Types of Data Warehouses

 Most useful techniques that help entrepreneurs, researchers, and individuals

 Also called Knowledge Discovery in Database (KDD): Data cleaning, Data

 The process of extracting information to identify patterns, trends, and

 Data mining utilizes complex mathematical algorithms for data segments

 Data Mining is similar to Data Science carried out by a person, in a specific

Data Mining Applications

Data Mining in Market Basket Analysis:

Data mining in Education:

Data Mining in Manufacturing Engineering:

Data mining tools can be beneficial to find patterns in a complex manufacturing

Data Mining in CRM (Customer Relationship Management):

To get a decent relationship with the customer, a business organization needs to

Data Mining in Fraud detection: