UNIT-III Data Warehouse and Minig Notes MDU

iniUNIT-III Data preprocessing: Need, Preprocessing stages: Data integration, Data Transformation, Data
Reduction, Discretization and Concept Hierarchy Generation, Data mining primitives, Types of Data
Mining, Architectures of data mining systems. Data Characterization: Data generation & Summarization
based characterization, Analytical characterization, Mining class comparisons.
Mining Association Rules in large databases: Association Rule mining, Single dimensional Boolean
association rules from Transactional DBS, Multi level association rules from transaction DBS,
Multidimensional association rules from relational DBS and DWS, Correlation analysis, Constraint based
association mining.
DATA Cleaning
NOISE
Noise is a random error or variance in a measured variable. Noisy Data may be due to faulty data
collection instruments, data entry problems and technology limitation.
Binning:
Binning methods sorted data value by consulting its “neighborhood,” that is, the values around it. The
sorted values are distributed into a number of “buckets,” or bins.
For example
Price = 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins size 3:
Bin a: 4, 8, 14,15={4,4,15,15}
Bin b: 21, 21, 24
Bin c: 25, 28, 34
In this example, the data for price are first sorted and then partitioned into equal-frequency bins of size 3.
Smoothing by bin means:
Bin a: 9, 9, 9
Bin b: 22, 22, 22
Bin c: 29, 29, 29
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
Smoothing by bin boundaries:
Bin a: 4, 4, 15
Bin b: 21, 21, 24
Bin c: 25, 25, 34
In smoothing by bin boundaries, the minimum and

maximum values in a given bin are identified as the bin boundaries. In smoothing by bin boundaries, each bin
value is replaced by the closest boundary value. It is a type of Data discretization technique.
Regression
Data can be smoothed by fitting the data into a regression functions. Linear regression involves finding the
“best” line to fit two attributes (or variables) so that one attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface
Clustering:
Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.
Values that fall outside of the set of clusters may be considered outliers.
DATA INTEGRATION
Data Integration is a data preprocessing technique that combines data from multiple sources and
provides users a unified view of these data. These sources may include multiple databases, data cubes,
or flat files. One of the most well-known implementation of data integration is building an enterprise's data
warehouse. The benefit of a data warehouse enables a business to perform analyses based on the data
in the data warehouse.
There are mainly 2 major approaches for data integration:-
1 Tight Coupling
In tight coupling data is combined from different sources into a single physical location through the
process of ETL - Extraction, Transformation and Loading.
2 Loose Coupling
In loose coupling data only remains in the actual source databases. In this approach, an interface is
provided that takes query from user and transforms it in a way the source database can understand and
then sends the query directly to the source databases to obtain the result.
Issues in data integration
1. Schema integration/Entity identification problem
eg. one table has customer_id and another database table has cust_number as the attribute
so how these both are related?
solution:
1) refer to the metadata of each attribute whih includes the name,meaning,data type, range of
permitted values
2)metadata helps in data integration of the two database into one
3) attention must be given on the structure of data
4) any attribute functional dependencies and referential constraints in the source system match those in
the target system
Redundancy and Correlation Analysis
1) The attribute is redundant if it can be “derived” from another attribute or set of attributes.
2) Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set.
3) Some redundancies can be detected by correlation analysis.
For nominal data, we use the χ2 (chi-square) test.

For numeric attributes, we can use the correlation coefficient and covariance, both of which access how one
attribute’s values vary from those of another.
TUPLE DUPLICATION
Detection and resolution of data value conflicts:

 This is the fourth important issues in data integration.
 Attribute values from another different sources may differ for the same real world entity.
 An attribute in one system may be recorded at a lower level abstraction then the “same” attribute in another.
DATA REDUCTION
Data reduction techniques can be applied to obtain a reduced representation of the

data set that ismuch smaller in volume, yet closely maintains the integrity of the original
data.
dimensionality reduction, numerosity reduction, and

data compression.
1. Data Cube Aggregation:

This technique is used to aggregate data in a simpler form. For example, imagine that information you gathered for your
analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. They involve
you in the annual sales, rather than the quarterly average, So we can summarize the data in such a way that the
resulting data summarizes the total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required
for our analysis. It reduces data size as it eliminates outdated or redundant features.
1. Wavelet Transform
In the wavelet transform, a data vector X is transformed to a numerically different data vector X’ such
that both X and X’ vectors are of the same length. Then how it is useful in reducing data?
The data obtained from the wavelet transform can be truncated. The compressed data is obtained
by retaining the smallest fragment of the strongest of wavelet coefficients.
Wavelet transform can be applied to data cube, sparse data or skewed data.
2. Principal Component Analysis

Let us consider we have a data set to be analyzed that has tuples with n attributes, then the
principal component analysis identifies k independent tuples with n attributes that can
represent the data set.
In this way, the original data can be cast on a much smaller space. In this way, the
dimensionality reduction can be achieved.
Principal component analysis can be applied to sparse, and skewed data.
3. Attribute Subset Selection

The large data set has many attributes some of which are irrelevant to data mining or some
are redundant. The attribute subset selection reduces the volume of data by eliminating the
redundant and irrelevant attribute.
The attribute subset selection makes it sure that even after eliminating the unwanted
attributes we get a good subset of original attributes such that the resulting probability of
data distribution is as close as possible to the original data distribution using all the
attributes.
1. Step-wise Forward Selection –

The selection begins with an empty set of attributes later on we decide best of the original attributes on the set based on
their relevance to other attributes. We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
P
_value{0.7,0.6,0.1,0.1,0.5,0.2}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}

1. Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it eliminates the worst remaining
attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }
Step-1: {X1, X2, X3, X4, X5}

Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}

2. Combination of forwarding and Backward Selection –
It allows us to remove the worst and select best attributes, saving time and making the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms (Huffman Encoding & run-length
Encoding). We can divide it into two types based on their compression techniques.
 Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless data compression
uses algorithms to restore the precise original data from the compressed data.
 Lossy Compression –
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples of this
compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent to the original
the image. In lossy-data compression, the decompressed data may differ to the original data but are useful enough to
retrieve information from them.
4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or smaller representation of the data
instead of actual data, it is important to only store the model parameter. Or non-parametric method such as clustering,
histogram, sampling.
Parametric methods
Regression and log-linear models can be used to approximate the given data. In (simple)
linear regression, the data are modeled to fit a straight line. For example, a random
variable, y (called a response variable), can be modeled as a linear function of another
random variable, x (called a predictor variable), with the equation
where the variance of y is assumed to be constant. In the context of data mining, x and y
are numeric database attributes. The coefficients, w and b (called regression coefficients)
Log-linear models approximate discrete multidimensional probability distributions.

Given a set of tuples in n dimensions (e.g., described by n attributes), we can consider
each tuple as a point in an n-dimensional space. Log-linear models can be used
to estimate the probability of each point in a multidimensional space for a set of discretized
attributes, based on a smaller subset of dimensional combinations.
Non-parametric methods
clustering
Clustering techniques consider data tuples as objects. They partition the objects into
groups, or clusters, so that objects within a cluster are “similar” to one another and “dissimilar”
to objects in other clusters.
The “quality” of a cluster

may be represented by its diameter, the maximum distance between any two objects in
the cluster. Centroid distance is an alternative measure of cluster quality and is defined
as the average distance of each cluster object from the cluster centroid (denoting the
“average object,” or average point in space for the cluster).
histogram
Histograms use binning to approximate data distributions and are a popular form
of data reduction. A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, referred
to as buckets or bins.
If each bucket represents only a single attribute–value/frequency pair, the
buckets are called singleton buckets
example:
Histograms. The following data are a list of AllElectronics prices for commonly sold
items (rounded to the nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5,
5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,
18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,
30, 30.
Singleton buckets are useful for storing high-frequency outliers.
Equal-width: In an equal-width histogram, the width of each bucket range is
uniform (e.g., the width of $10 for the buckets in Figure 3.8).
Equal-frequency (or equal-depth): In an equal-frequency histogram, the buckets are
created so that, roughly, the frequency of each bucket is constant (i.e., each bucket
contains roughly the same number of contiguous data samples)
Sampling
Sampling can be used for data reduction because it allows a large data set to be represented by a much smaller random
data sample (or subset).
Difference between Dimensionality Reduction and Numerosity Reduction :
DIMENSIONALITY REDUCTION NUMEROSITY REDUCTION
In dimensionality reduction, data encoding
or data transformations are applied to obtain In Numerosity reduction, data volume is
a reduced or compressed for of original reduced by choosing suitable alternating
data. forms of data representation.
It can be used to remove irrelevant or It is merely a representation technique
redundant attributes. of original data into smaller form.
In this method, some data can be lost which In this method, there is less of loss of
is irrelevant. data.
Methods for Numerosity reduction are:
Methods for dimensionality reduction are: 1. Regression or log-linear

model (parametric).
1. Wavelet transformations. 2. Histograms, clusturing,
2. Principal Component Analysis. sampling (non-parametric).
The components of dimensionality
reduction are feature selection and feature It has no components but methods that
extraction. ensure reduction of data volume.
It leads to less misleading data and more It preserves the integrity of data and the
model accuracy. data volume is also reduced.
Data Transformation and Data Discretization

In this preprocessing step, the data are transformed or consolidated so that the resulting mining process may be more
efficient, and the patterns found may be easier to understand.
Data Transformation Strategies Overview

In data transformation, the data are transformed or consolidated into forms appropriate
for mining. Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning,
regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed
and added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for data analysis
at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range,
such as -1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).
The labels, in turn, can be recursively organized into higher-level concepts, resulting
in a concept hierarchy for the numeric attribute. Figure 3.12 shows a concept hierarchy
for the attribute price. More than one concept hierarchy can be defined for the same
attribute to accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can
be generalized to higher-level concepts, like city or country.
Data Discretization techniques can be used to divide the range of

continuous attribute into intervals. Numerous continuous attribute values
are replaced by small interval labels.
This leads to a concise, easy-to-use, knowledge-level representation of

mining results.
Top-down discretization
If the process starts by first finding one or a few points (called split points
or cut points) to split the entire attribute range, and then repeats this
recursively on the resulting intervals, then it is called top-down
discretization or splitting.
Bottom-up discretization
If the process starts by considering all of the continuous values as
potential split-points, removes some by merging neighborhood values to
form intervals, then it is called bottom-up discretization or merging.
Discretization can be performed rapidly on an attribute to provide a

hierarchical partitioning of the attribute values, known as a concept
hierarchy.
DATA TRANSFORMATION BY NORMALIZATION
The measurement unit used can affect the data analysis. For example, changing measurement
units from meters to inches for height, or from kilograms to pounds for weight, may lead to very different results.
To help avoid dependence on the choice of measurement units, the
data should be normalized or standardized. This involves transforming the data to fall
within a smaller or common range such as [-1, 1] or [0.0, 1.0]. (The terms standardize
and normalize are used interchangeably in data preprocessing
Normalization
is particularly useful for classification algorithms involving neural networks or
distance measurements such as nearest-neighbor classification and clustering. If using
the neural network backpropagation algorithm for classification mining
min-max normalization,
z-score normalization, and normalization by decimal scaling
let A be
a numeric attribute with n observed values, v1, v2, : : : , vn.
Data mining is defined as a process used to extract usable data from a larger set of any raw data. It implies
analysing data patterns in large batches of data using one or more software. Data mining is also known as
Knowledge Discovery in Data (KDD).
Data mining is the process of finding anomalies, patterns and correlations within large data sets to predict
outcomes. Using a broad range of techniques, you can use this information to increase revenues, cut costs,
improve customer relationships, reduce risks
Data mining primitives.

A data mining query is defined in terms of the following primitives

Task-relevant data: This is the database portion to be investigated. For example,
suppose that you are a manager of All Electronics in charge of sales in the United
States and Canada. In particular, you would like to study the buying trends of
customers in Canada. Rather than mining on the entire database. These are
referred to as relevant attributes

The kinds of knowledge to be mined: This specifies the data mining functions to
be performed, such as characterization, discrimination, association, classification,
clustering, or evolution analysis. For instance, if studying the buying habits of
customers in Canada, you may choose to mine associations between customer
profiles and the items that these customers like to buy

Background knowledge: Users can specify background knowledge, or
knowledge about the domain to be mined. This knowledge is useful for guiding
the knowledge discovery process, and for evaluating the patterns found. There are
several kinds of background knowledge.

Interestingness measures: These functions are used to separate uninteresting
patterns from knowledge. They may be used to guide the mining process, or after
discovery, to evaluate the discovered patterns. Different kinds of knowledge may
have different interestingness measures.

Presentation and visualization of discovered patterns: This refers to the form
in which discovered patterns are to be displayed. Users can choose from different
forms for knowledge presentation, such as rules, tables, charts, graphs, decision
trees, and cubes.
Types of data that can be mined
1. Data stored in the database
A database is also called a database management system or DBMS. Every DBMS stores data that are
related to each other in a way or the other. It also has a set of software programs that are used to manage
data and provide easy access to it. These software programs serve a lot of purposes, including defining
structure for database, making sure that the stored information remains secured and consistent, and
managing different types of data access, such as shared, distributed, and concurrent.
A relational database has tables that have different names, attributes, and can store rows or records of
large data sets. Every record stored in a table has a unique key. Entity-relationship model is created to
provide a representation of a relational database that features entities and the relationships that exist
between them.
2. Data warehouse
A data warehouse is a single data storage location that collects data from multiple sources and then stores
it in the form of a unified plan. When data is stored in a data warehouse, it undergoes cleaning,
integration, loading, and refreshing. Data stored in a data warehouse is organized in several parts. If you
want information on data that was stored 6 or 12 months back, you will get it in the form of a summary.
3. Transactional data
Transactional database stores record that are captured as transactions. These transactions include flight
booking, customer purchase, click on a website, and others. Every transaction record has a unique ID. It
also lists all those items that made it a transaction.
4. Other types of data
We have a lot of other types of data as well that are known for their structure, semantic meanings, and
versatility. They are used in a lot of applications. Here are a few of those data types: data streams,
engineering design data, sequence data, graph data, spatial data, multimedia data, and more.
Data Mining Techniques
1. Association
It is one of the most used data mining techniques out of all the others. In this technique, a transaction and
the relationship between its items are used to identify a pattern. This is the reason this technique is also
referred to as a relation technique. It is used to conduct market basket analysis, which is done to find out
all those products that customers buy together on a regular basis.
This technique is very helpful for retailers who can use it to study the buying habits of different
customers. Retailers can study sales data of the past and then lookout for products that customers buy
together. Then they can put those products in close proximity of each other in their retail stores to help
customers save their time and to increase their sales.
2. Clustering
This technique creates meaningful object clusters that share the same characteristics. People often confuse
it with classification, but if they properly understand how both these techniques work, they won’t have
any issue. Unlike classification that puts objects into predefined classes, clustering puts objects in classes
that are defined by it.
Let us take an example. A library is full of books on different topics. Now the challenge is to organize
those books in a way that readers don’t have any problem in finding out books on a particular topic. We
can use clustering to keep books with similarities in one shelf and then give those shelves a meaningful
name. Readers looking for books on a particular topic can go straight to that shelf. They won’t be
required to roam the entire library to find their book.
3. Classification
This technique finds its origins in machine learning. It classifies items or variables in a data set into
predefined groups or classes. It uses linear programming, statistics, decision trees, and artificial neural
network in data mining, amongst other techniques. Classification is used to develop software that can be
modelled in a way that it becomes capable of classifying items in a data set into different classes.
For instance, we can use it to classify all the candidates who attended an interview into two groups – the
first group is the list of those candidates who were selected and the second is the list that features
candidates that were rejected. Data mining software can be used to perform this classification job.
4. Prediction
This technique predicts the relationship that exists between independent and dependent variables as well
as independent variables alone. It can be used to predict future profit depending on the sale. Let us
assume that profit and sale are dependent and independent variables, respectively. Now, based on what
the past sales data says, we can make a profit prediction of the future using a regression curve.
5. Sequential patterns
This technique aims to use transaction data, and then identify similar trends, patterns, and events in it over
a period of time. The historical sales data can be used to discover items that buyers bought together at
different times of the year. Business can make sense of this information by recommending customers to
buy those products at times when the historical data doesn’t suggest they would. Businesses can use
lucrative deals and discounts to push through this recommendation
Data Mining Architecture
The significant components of data mining systems are a data source, data mining engine, data warehouse server, the
pattern evaluation module, graphical user interface, and knowledge base.
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and other documents. You
need a huge amount of historical data for data mining to be successful. Organizations typically store data in databases or
data warehouses. Data warehouses may comprise one or more databases, text files spreadsheets, or other repositories of
data. Sometimes, even plain text files or spreadsheets may contain information. Another primary source of data is the
World Wide Web or the internet.
Before passing the data to the database or data warehouse server, the data must be cleaned, integrated, and selected. As
the information comes from various sources and in different formats, it can't be used directly for the data mining
procedure because the data may not be complete and accurate. So, the first data requires to be cleaned and unified. More
information than needed will be collected from various data sources, and only the data of interest will have to be selected
and passed to the server. These procedures are not as easy as we think. Several methods may be performed on the data
as part of selection, integration, and cleaning.
Database or Data Warehouse Server:
The database or data warehouse server consists of the original data that is ready to be processed. Hence, the server is
cause for retrieving the relevant data that is based on data mining as per user request.
Data Mining Engine:
The data mining engine is a major component of any data mining system. It contains several modules for operating data
mining tasks, including association, characterization, classification, clustering, prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It comprises instruments and software
used to obtain insights and knowledge from data collected from various data sources and stored within the data
warehouse.
Pattern Evaluation Module:
The Pattern evaluation module is primarily responsible for the measure of investigation of the pattern by using a threshold
value. It collaborates with the data mining engine to focus the search on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining modules to focus the search
towards fascinating patterns. It might utilize a stake threshold to filter out discovered patterns. On the other hand, the
pattern evaluation module might be coordinated with the mining module, depending on the implementation of the data
mining techniques used. For efficient data mining, it is abnormally suggested to push the evaluation of pattern stake as
much as possible into the mining procedure to confine the search to only fascinating patterns.
Graphical User Interface:
The graphical user interface (GUI) module communicates between the data mining system and the user. This module helps
the user to easily and efficiently use the system without knowing the complexity of the process. This module cooperates
with the data mining system when the user specifies a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the search or evaluate the
stake of the result patterns. The knowledge base may even contain user views and data from user experiences that might
be helpful in the data mining process. The data mining engine may receive inputs from the knowledge base to make the
result more accurate and reliable. The pattern assessment module regularly interacts with the knowledge base to get
inputs, and also update it.
DATA MINING AS KDD PROCESS
Knowledge data discovery
Data Characterization: Data generation & Summarization based characterization

DATA CHARACTERIZATION
Data characterization is a summarization of the general characteristics or features of a target

class of data. The data corresponding to the user-specified class are typically collected by a
query.
Data characterization is a summarization of general features of objects in a target
class, and produces what is called characteristic rules.
apple--> FRUIT
apple-->RED
apple-->ROUND
apple-->FRUIT ^RED^ROUND
Daridabad, rewari,rohtak,murthal
The data relevant to a user-specified class are normally retrieved by a database

query and run through a summarization module to extract the essence of the data at
different levels of abstractions.
For example, one may want to characterize the OurVideoStore customers who
regularly rent more than 30 movies a year. With concept hierarchies on the
attributes describing the target class, the attribute-oriented induction method can be
used, for example, to carry out data summarization. Note that with a data cube
containing summarization of data, simple OLAP operations fit the purpose of data
characterization
Analytical Characterization in Data Mining – Attribute Relevance Analysis

Analytical Characterization is a very important topic in data mining, and we will explain it with
the following situation;
We want to characterize the class or in other words, we can say that suppose we want to
compare the classes. Now the confusing question is that What if we are not sure which attribute
we should include for the class characterization or class comparison? If we specify too many
attributes, then these attributes can be a solid reason to slow down the overall process of data
mining.
We can solve this problem with the help of analytical characterization.
Analytical characterization
Analytical characterization is used to help and identifying the weakly relevant, or irrelevant
attributes. We can exclude these unwanted irrelevant attributes when we preparing our data for
the mining.
Why Analytical Characterization?

Analytical Characterization is a very important activity in data mining due to the following
reasons;
Due to the limitation of the OLAP tool about handling the complex objects.
Due to the lack of an automated generalization, we must explicitly tell the system which
attributes are irrelevant and must be removed, and similarly, we must explicitly tell the system
which attributes are relevant and must be included in the class characterization.
Attribute generalization thresholds

Due to the lack of an automated generalization, we must explicitly tell the system how much
deeper we need to generalize the attribute.
The process of generalization is totally dependent on the user who explicitly performs all these
actions.
How to Analyse Attribute Relevance?
Data Collection
The data is collected for the target class and its contrasting class.
Preliminary relevance analysis with the help of conservative AOI
We need to decide a set of dimensions and attributes and apply the selected relevance
measure on them. The candidate relation of the mining task is a term used for obtaining the
relation with such an application of Attribute Oriented Induction.
{A,B,C,D<E,F,G,H}
{A,B,E,F,H}
Relevance analysis to remove the irrelevant or weakly relevant attributes
This step consists of steps of Relevance analysis for removing the weakly or irrelevant attribute
Attribute Oriented Induction to generate the concepts
We need to perform the Attribute Oriented Induction. Attribute-Oriented Induction (AOI) is an
algorithm for data summarization. AOI can suffer the problem of over-generalization. Data
summarization is a data mining technique with the help of which we can summarize the big data
in concise understandable knowledge.
Relevance Measures
We can determine the classifying power of an attribute within a set of data with the help of a
Quantitative relevance measure. „
Some competing methods of Relevance Measures are mentioned below;
 Gini index„
 χ2 contingency table statistics
 Gain ratio (C4.5)
 Uncertainty coefficient
 information gain (ID3)
From Data Analysis point of view, we can classify the data mining into the following two
categories; Predictive data mining
1. Predictive data mining

2. Descriptive data mining
Descriptive data mining

We can describe the data set in a concise way and it is also helpful in presenting the interesting
properties of the given data.
Predictive data mining
Predictive data mining is helpful in analyzing the data to construct one or a set of models.
Predictive data mining is useful in predicting the behavior of new data sets.
If we discuss about databases, then databases have a big amount of data. But the user wants
to store a big data but interested in getting the summarized and concise data and information.
Predictive data mining is helpful to show an overall picture of a class of data. Similarly,
Predictive data mining is useful to distinguish it from a set of comparative classes.
Concept Description
Concept Description is the simplest kind of descriptive data mining. A concept is a term that
can be used for a collection of data. The collection of data examples are mentioned below;
new_students, graduate_students, alumni, and so on.
We can’t say that the data mining task concept description is a simple enumeration of the data.
The solid reason behind it is because the concept description generates descriptions for the
comparison and characterization of the data.
The term concept description is also referred to as class description especially when the
concept to be described is about a class of objects.
Comparison of data
• Comparison of data provides the descriptions of comparing more than one data collection.
Characterization of data
Characterization provides a concise summary of the given collection of data.
Data Generalization & Summarization

When we are at the primitive concept level, then Data and objects in databases contain detailed
information.
For example, let’s see the example of a sales database. Suppose the database contains
attributes that are describing the low-level item data and information. Low level information
examples are item_ID, item_name, item_supplier, item_place_made, item_brand,
item_category, and item_price.
Data Summarization is very helpful to summarize a large set of data. Data Summarization is
helpful to present it at a high conceptual level.
Let’s see one example. Suppose we want to summarize a large amount of data related to the
sales during the summer holidays and providing a general description of such data, which can
be very helpful for the sales department and managers.
All of this activity requires the help of data generalization.
Data Generalization
Data Generalization abstracts a large set of data that is relevant to the task in a database from a
low to the higher conceptual level.
Data Generalization is helpful for creating the characteristic rules and it is a summarization of
general features of objects in a target class.
How to retrieve data relevant to a user-specified class?
The data can be retrieved by a database query and with the help of a summarization module to
extract the essence of the data at different levels of abstractions.
Presentation Of Generalized Results

Generalized Relation:
 Relations where some or all attributes are
generalized, with counts or other aggregation values
accumulated.
Cross-Tabulation:
 Mapping results into cross-tabulation form (similar to
contingency tables).
Visualization Techniques:
 Pie charts, bar charts, curves, cubes, and other visual
forms.
Quantitative characteristic rules:
 Mapping generalized results in characteristic rules
with quantitative information associated with it.

Data Cube Approach
It is nothing but performing computations and storing
results in data cubes.
Strength
 An efficient implementation of data generalization.
 Computation of various kinds of measures, e.g.,
count( ), sum( ), average( ), max( ).
 Generalization and specialization can be performed
on a data cube by roll-up and drill-down.
Limitations
 It handles only dimensions of simple non-numeric
data and measures of simple aggregated numeric values.
 Lack of intelligent analysis, can’t tell which
dimensions should be used and what levels should the
generalization reach.
Association Rule Mining
Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that

appear frequently in a data set. For example, a set of items, such as milk and bread, that
appear frequently together in a transaction data set is a frequent itemset.
A subsequence,
such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently
in a shopping history database, is a (frequent) sequential pattern. If a substructure occurs frequently, it is
called a (frequent) structured pattern. Finding frequent patterns plays an essential role in
mining associations, correlations, and many other interesting relationships among data.
Frequent pattern mining searches for recurring relationships in a given data set.
Market Basket Analysis

This process
analyzes customer buying habits by finding associations between the different items that
customers place in their “shopping baskets”. The discovery of these associations
can help retailers develop marketing strategies by gaining insight into which items
are frequently purchased together by customers. For instance, if customers are buying
milk, how likely are they to also buy bread (and what kind of bread) on the same trip. This information can lead to
increased sales by helping retailers do
selective marketing and plan their shelf space.
You can then use the results to plan marketing or

advertising strategies, or in the design of a new catalog. For instance, market basket analysis
may help you design different store layouts. In one strategy, items that are frequently
purchased together can be placed in proximity to further encourage the combined sale
of such items.
Market basket analysis can
also help retailers plan which items to put on sale at reduced prices. If customers tend to
purchase computers and printers together, then having a sale on printers may encourage
the sale of printers as well as computers.
Each item has a Boolean variable representing the presence or absence of that item. Each basket can then
be represented by a Boolean vector of values assigned to these variables. The Boolean
vectors can be analyzed for buying patterns that reflect items that are frequently associated
or purchased together. These patterns can be represented in the formof association
rules.
For example, the information that customers who purchase computers also tend
to buy antivirus software at the same time is represented in the following association
rule:
A B
Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules. A support of 2% for
Rule (6.1) means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together.
A confidence of 60% means that 60% of
the customers who purchased a computer also bought the software. association
rules are considered interesting if they satisfy both a minimum support threshold
and a minimum confidence threshold.
FREQUENT ITEM SETS, CLOSED ITEM SETS & ASSOCIATION RULES
Let I={ I1,I2,I3........,Im} be itemset

D= task relevent data set of database transactions T with an identifier TID
A= set of items
Transaction T is said to contain A if
association Rule is an implication Rule

A set of items is referred to as an itemset.2 An itemset that contains k items is a k-itemset. The set fcomputer,
antivirus softwareg is a 2-itemset.
The occurrence frequency of an itemset is the number of transactions that contain the itemset.
Association Rule mining is a two step process

1) finding all frequent Item sets
2) generating strong association rules from frequent item sets
APRIORI ALGORITHM
Mining Frequent itemsets using confined candidate generation
Apriori algorithm, a classic algorithm, is useful in mining frequent itemsets and relevant

association rules. Usually, you operate this algorithm on a database containing a large number
of transactions. One such example is the items customers buy at a supermarket.

Name of the algorithm is Apriori because it uses prior knowledge of frequent
itemset properties. We apply an iterative approach or level-wise search where k-
frequent itemsets are used to find k+1 itemsets.
To improve the efficiency of level-wise generation of frequent itemsets, an
important property is used called Apriori property which helps by reducing the
search space.
Apriori Property –
All non-empty subset of frequent itemset must be frequent.
All subsets of a frequent itemset must be frequent(Apriori propertry).
If an itemset is infrequent, all its supersets will be infrequent.
minimum support count is 2
minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item present in dataset –
Called C1(candidate set)
II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support
then remove those items). This gives us itemset L1.
Step-2: K=2
 Generate candidate set C2 using L1 (this is called join step). Condition of
joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.
 Check all subsets of an itemset are frequent or not and if not frequent
remove that itemset.(Example subset of{I1, I2} are {I1}, {I2} they are
frequent.Check for each itemset)
 Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L2.
{I1,I2}{I1,I3}
{I1,I2,I3} =frequent item set
{I1,I2}{I1,I5}
{I1,I2,I5}=frequent item set
{i1,i2}{i2,i3}={i1,i2,i3}
{i1,i2}{i2,i4}={i1,i2,i4}
{i1,i4} infrequent ie why I remove this {i1,i2,i4} itemset
{i2,i3,i4}
{i2,i4,i5}
Step-3:
 Generate candidate set C3 using L2 (join step). Condition of joining L k-
and Lk-1 is that it should have (K-2) elements in common. So here, for L2,
1
first element should match.

So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3,
I4}{I2, I4, I5}{I2, I3, I5}
 Check if all subsets of these itemsets are frequent or not and if not,
then remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1,
I3} which are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so
remove it. Similarly check for every itemset)
 find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L3.
Step-4:
{i1,i2,i3,i5} it is not frequent
 Generate candidate set C4 using L3 (join step). Condition of joining L k-
1 and Lk-1 (K=4) is that, they should have (K-2) elements in common. So
here, for L3, first 2 elements (items) should match.

 Check all subsets of these itemsets are frequent or not (Here itemset
formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5},
which is not frequent). So no itemset in C4
 We stop here because no frequent itemsets are found further
Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of
each rule.
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and
bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule
generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1Î2]=>[I3] //confidence = sup(I1Î2Î3)/sup(I1Î2) = 2/4*100=50%
A=[I1Î2]---> B[I3]
sup(I1Î2Î3)/sup(I1Î2)= 2/4*100=50%
[I1]=>[I2Î3] //confidence = sup(I1Î2Î3)/sup(I1) = 2/6*100=33%

So if minimum confidence is 50%, then first 3 rules can be considered as strong
association rules.
Mining multilevel association rules from transactional databases
MULTILEVEL ASSOCIATION RULES:
 Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel
association rules.
 Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework.
 Rules at high concept level may add to common sense while rules at low concept level may not be useful always.
o Using uniform minimum support for all levels:
 When a uniform minimum support threshold is used, the search procedure is simplified.
 The method is also simple, in that users are required to specify only one minimum support threshold.
 The same minimum support threshold is used when mining at each level of abstraction.
 For example, in Figure, a minimum support threshold of 5% is used throughout.
 (e.g. for mining from “computer” down to “laptop computer”).
 Both “computer” and “laptop computer” are found to be frequent, while “desktop computer” is not.
 Using reduced minimum support at lower levels:
o Each level of abstraction has its own minimum support threshold.
o The deeper the level of abstraction, the smaller the corresponding threshold is.
o For example in Figure, the minimum support thresholds for levels 1 and 2 are 5% and 3%, respectively.
o In this way, “computer,” “laptop computer,” and “desktop computer” are all considered frequent.
ALL
COMP S/W H/W
anti-virus printer
laptop desktop
office camera
The minimum support level of the highest abstraction is higher as compared to the minimum support lower levels.
Approaches to be implemented for the multilevel mining
1) Using uniform support level for all levels
When Concept hierarchy is added with the support framework only then we can extract the association rules. If
greater support count is taken then only the highest level would qualify for rule formation as well as less.
Comp Min supp=

10%
5% for all
levels
Laptop desktop
6% 4%
2) Reduced Support
Comp level -1: 5%

10%
level 2:3%
Laptop Desktop
6% 4%
bread
Mining multidimensional association rules from transactional databases and data warehouse
Mining multidimensional association rules,that is ,rules involving more than one dimension or
predicate(e.g.,rules relating what a customer buys as well as the customer's age).These methods can be
organized according to their treatment of quantitative attributes.
In Multi dimensional association:
 Attributes can be categorical or quantitative.

 Quantitative attributes are numeric and incorporates hierarchy.
 Numeric attributes must be discretized.
 Multi dimensional association rule consists of more than one dimension
Multi‐Dimensional Association
• Single‐dimensional rules
buys(X, “milk”)  buys(X, “bread”)
• Multi‐dimensional rules
– Inter‐dimension association rules ‐no repeated predicates
age(X,”19‐25 ) ”  occupation(X,“student" )  buys(X,"coke" )
– hybrid‐dimension association rules ‐repeated predicates
age(X,”19‐25”)  buys(X, “popcorn”)  buys(X, “coke”)
Multi‐Dimensional Association Attributes
• Categorical Attributes – finite number of possible values, no ordering among values
• Quantitative Attributes – numeric, implicit ordering implicit ordering among values
Techniques for Mining MD Associations
• Search for frequent k‐predicate set: – Example: {age, occupation, buys} is a 3‐predicate set. –
Techniques can be categorized by how age are treated.
1. Using static discretization of quantitative attributes – Quantitative attributes are statically

discretized by using predefined concept hierarchies.
2. Quantitative association rules – Quantitative attributes are dynamically discretized

discredited into “bins” based on the distribution of the data.
3. Distance‐based association rules – This is a dynamic discretization process that considers the
distance between data points.
Three approaches in mining multi dimensional association rules:

1.Using static discritization of quantitative attributes.
 Discritization is static and occurs prior to mining.

 Discritized attributes are treated as categorical.
 Use apriori algorithm to find all k-frequent predicate sets(this requires k or k+1 table scans ).
 Every subset of frequent predicate set must be frequent.
 Eg: If in a data cube the 3D cuboid (age, income, buys) is frequent implies (age, income), (age, buys), (income, buys) are
also frequent.
 Data cubes are well suited for mining since they make mining faster.
 The cells of an n-dimensional data cuboid correspond to the predicate cells.
all
income buys
age
income, buys
age,income age,buy
s
age,income,buys
2.Using dynamic discritization of quantitative attributes:
 Known as mining Quantitative Association Rules.

 Numeric attributes are dynamically discretized.
 Eg: age(X,”20..25”) Λ income(X,”30K..41K”) => buys (X,”Laptop Computer”)
GRID FOR TUPLES

3.Using distance based discritization with clustering.
This id dynamic discretization process that considers the distance between data points.
 It involves a two step mining process:

o Perform clustering to find the interval of attributes involved.
o Obtain association rules by searching for groups of clusters that occur together.
 The resultant rules may satisfy:
o Clusters in the rule antecedent are strongly associated with clusters of rules in the consequent.
o Clusters in the antecedent occur together.
o Clusters in the consequent occur together.

UNIT-III Data Warehouse and Minig Notes MDU

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT-III Data Warehouse and Minig Notes MDU

Uploaded by

Copyright:

Available Formats

iniUNIT-III Data preprocessing: Need, Preprocessing stages: Data integration, Data Transformation, Data

Price = 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into (equal-frequency) bins size 3:

Bin b: 21, 21, 24

Bin c: 25, 28, 34

Smoothing by bin means:

Bin c: 29, 29, 29

Smoothing by bin boundaries:

Bin b: 21, 21, 24

Bin c: 25, 25, 34

In smoothing by bin boundaries, the minimum and

Issues in data integration

1. Schema integration/Entity identification problem

so how these both are related?

2)metadata helps in data integration of the two database into one

3) attention must be given on the structure of data

Redundancy and Correlation Analysis

For nominal data, we use the χ2 (chi-square) test.

Detection and resolution of data value conflicts:

Data reduction techniques can be applied to obtain a reduced representation of the

dimensionality reduction, numerosity reduction, and

1. Data Cube Aggregation:

2. Principal Component Analysis

3. Attribute Subset Selection

1. Step-wise Forward Selection –

Final reduced attribute set: {X1, X2, X5}

Step-1: {X1, X2, X3, X4, X5}

Final reduced attribute set: {X1, X2, X5}

Log-linear models approximate discrete multidimensional probability distributions.

The “quality” of a cluster

In dimensionality reduction, data encoding

or data transformations are applied to obtain In Numerosity reduction, data volume is

a reduced or compressed for of original reduced by choosing suitable alternating

data. forms of data representation.

It can be used to remove irrelevant or It is merely a representation technique

redundant attributes. of original data into smaller form.

Methods for Numerosity reduction are:

Methods for dimensionality reduction are: 1. Regression or log-linear

The components of dimensionality

extraction. ensure reduction of data volume.

model accuracy. data volume is also reduced.

Data Transformation and Data Discretization

Data Transformation Strategies Overview

Data Discretization techniques can be used to divide the range of

This leads to a concise, easy-to-use, knowledge-level representation of

Discretization can be performed rapidly on an attribute to provide a

Data mining primitives.

Data Mining Architecture

Database or Data Warehouse Server:

Data Mining Engine:

Pattern Evaluation Module:

Graphical User Interface:

Data Characterization: Data generation & Summarization based characterization

Data characterization is a summarization of the general characteristics or features of a target

The data relevant to a user-specified class are normally retrieved by a database

Analytical Characterization in Data Mining – Attribute Relevance Analysis

Why Analytical Characterization?

Attribute generalization thresholds

1. Predictive data mining

Descriptive data mining

Data Generalization & Summarization

Presentation Of Generalized Results

Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that