Professional Documents
Culture Documents
Reduction, Discretization and Concept Hierarchy Generation, Data mining primitives, Types of Data
Mining, Architectures of data mining systems. Data Characterization: Data generation & Summarization
based characterization, Analytical characterization, Mining class comparisons.
Mining Association Rules in large databases: Association Rule mining, Single dimensional Boolean
association rules from Transactional DBS, Multi level association rules from transaction DBS,
Multidimensional association rules from relational DBS and DWS, Correlation analysis, Constraint based
association mining.
DATA Cleaning
NOISE
Noise is a random error or variance in a measured variable. Noisy Data may be due to faulty data
collection instruments, data entry problems and technology limitation.
Binning:
Binning methods sorted data value by consulting its “neighbor- hood,” that is, the values around it. The
sorted values are distributed into a number of “buckets,” or bins.
For example
Bin a: 4, 8, 14,15={4,4,15,15}
In this example, the data for price are first sorted and then partitioned into equal-frequency bins of size 3.
Bin a: 9, 9, 9
Bin b: 22, 22, 22
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
Bin a: 4, 4, 15
Regression
Data can be smoothed by fitting the data into a regression functions. Linear regression involves finding the
“best” line to fit two attributes (or variables) so that one attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface
Clustering:
Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.
Values that fall outside of the set of clusters may be considered outliers.
DATA INTEGRATION
Data Integration is a data preprocessing technique that combines data from multiple sources and
provides users a unified view of these data. These sources may include multiple databases, data cubes,
or flat files. One of the most well-known implementation of data integration is building an enterprise's data
warehouse. The benefit of a data warehouse enables a business to perform analyses based on the data
in the data warehouse.
There are mainly 2 major approaches for data integration:-
1 Tight Coupling
In tight coupling data is combined from different sources into a single physical location through the
process of ETL - Extraction, Transformation and Loading.
2 Loose Coupling
In loose coupling data only remains in the actual source databases. In this approach, an interface is
provided that takes query from user and transforms it in a way the source database can understand and
then sends the query directly to the source databases to obtain the result.
eg. one table has customer_id and another database table has cust_number as the attribute
solution:
1) refer to the metadata of each attribute whih includes the name,meaning,data type, range of
permitted values
4) any attribute functional dependencies and referential constraints in the source system match those in
the target system
1) The attribute is redundant if it can be “derived” from another attribute or set of attributes.
2) Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set.
3) Some redundancies can be detected by correlation analysis.
TUPLE DUPLICATION
DATA REDUCTION
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required
for our analysis. It reduces data size as it eliminates outdated or redundant features.
1. Wavelet Transform
In the wavelet transform, a data vector X is transformed to a numerically different data vector X’ such
that both X and X’ vectors are of the same length. Then how it is useful in reducing data?
The data obtained from the wavelet transform can be truncated. The compressed data is obtained
by retaining the smallest fragment of the strongest of wavelet coefficients.
Wavelet transform can be applied to data cube, sparse data or skewed data.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
P
_value{0.7,0.6,0.1,0.1,0.5,0.2}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms (Huffman Encoding & run-length
Encoding). We can divide it into two types based on their compression techniques.
Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless data compression
uses algorithms to restore the precise original data from the compressed data.
Lossy Compression –
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples of this
compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent to the original
the image. In lossy-data compression, the decompressed data may differ to the original data but are useful enough to
retrieve information from them.
4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or smaller representation of the data
instead of actual data, it is important to only store the model parameter. Or non-parametric method such as clustering,
histogram, sampling.
Parametric methods
Regression and log-linear models can be used to approximate the given data. In (simple)
linear regression, the data are modeled to fit a straight line. For example, a random
variable, y (called a response variable), can be modeled as a linear function of another
random variable, x (called a predictor variable), with the equation
where the variance of y is assumed to be constant. In the context of data mining, x and y
are numeric database attributes. The coefficients, w and b (called regression coefficients)
Non-parametric methods
clustering
Clustering techniques consider data tuples as objects. They partition the objects into
groups, or clusters, so that objects within a cluster are “similar” to one another and “dissimilar”
to objects in other clusters.
histogram
Histograms use binning to approximate data distributions and are a popular form
of data reduction. A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, referred
to as buckets or bins.
If each bucket represents only a single attribute–value/frequency pair, the
buckets are called singleton buckets
example:
Histograms. The following data are a list of AllElectronics prices for commonly sold
items (rounded to the nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5,
5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,
18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,
30, 30.
Singleton buckets are useful for storing high-frequency outliers.
Equal-width: In an equal-width histogram, the width of each bucket range is
uniform (e.g., the width of $10 for the buckets in Figure 3.8).
Equal-frequency (or equal-depth): In an equal-frequency histogram, the buckets are
created so that, roughly, the frequency of each bucket is constant (i.e., each bucket
contains roughly the same number of contiguous data samples)
Sampling
Sampling can be used for data reduction because it allows a large data set to be represented by a much smaller random
data sample (or subset).
Difference between Dimensionality Reduction and Numerosity Reduction :
DIMENSIONALITY REDUCTION NUMEROSITY REDUCTION
In this method, some data can be lost which In this method, there is less of loss of
is irrelevant. data.
reduction are feature selection and feature It has no components but methods that
It leads to less misleading data and more It preserves the integrity of data and the
4. Normalization, where the attribute data are scaled so as to fall within a smaller range,
such as -1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).
The labels, in turn, can be recursively organized into higher-level concepts, resulting
in a concept hierarchy for the numeric attribute. Figure 3.12 shows a concept hierarchy
for the attribute price. More than one concept hierarchy can be defined for the same
attribute to accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can
be generalized to higher-level concepts, like city or country.
Top-down discretization
If the process starts by first finding one or a few points (called split points
or cut points) to split the entire attribute range, and then repeats this
recursively on the resulting intervals, then it is called top-down
discretization or splitting.
Bottom-up discretization
If the process starts by considering all of the continuous values as
potential split-points, removes some by merging neighborhood values to
form intervals, then it is called bottom-up discretization or merging.
The measurement unit used can affect the data analysis. For example, changing measurement
units from meters to inches for height, or from kilograms to pounds for weight, may lead to very different results.
To help avoid dependence on the choice of measurement units, the
data should be normalized or standardized. This involves transforming the data to fall
within a smaller or common range such as [-1, 1] or [0.0, 1.0]. (The terms standardize
and normalize are used interchangeably in data preprocessing
Normalization
is particularly useful for classification algorithms involving neural networks or
distance measurements such as nearest-neighbor classification and clustering. If using
the neural network backpropagation algorithm for classification mining
min-max normalization,
z-score normalization, and normalization by decimal scaling
let A be
a numeric attribute with n observed values, v1, v2, : : : , vn.
Data mining is defined as a process used to extract usable data from a larger set of any raw data. It implies
analysing data patterns in large batches of data using one or more software. Data mining is also known as
Knowledge Discovery in Data (KDD).
Data mining is the process of finding anomalies, patterns and correlations within large data sets to predict
outcomes. Using a broad range of techniques, you can use this information to increase revenues, cut costs,
improve customer relationships, reduce risks
The significant components of data mining systems are a data source, data mining engine, data warehouse server, the
pattern evaluation module, graphical user interface, and knowledge base.
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and other documents. You
need a huge amount of historical data for data mining to be successful. Organizations typically store data in databases or
data warehouses. Data warehouses may comprise one or more databases, text files spreadsheets, or other repositories of
data. Sometimes, even plain text files or spreadsheets may contain information. Another primary source of data is the
World Wide Web or the internet.
Before passing the data to the database or data warehouse server, the data must be cleaned, integrated, and selected. As
the information comes from various sources and in different formats, it can't be used directly for the data mining
procedure because the data may not be complete and accurate. So, the first data requires to be cleaned and unified. More
information than needed will be collected from various data sources, and only the data of interest will have to be selected
and passed to the server. These procedures are not as easy as we think. Several methods may be performed on the data
as part of selection, integration, and cleaning.
The database or data warehouse server consists of the original data that is ready to be processed. Hence, the server is
cause for retrieving the relevant data that is based on data mining as per user request.
The data mining engine is a major component of any data mining system. It contains several modules for operating data
mining tasks, including association, characterization, classification, clustering, prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It comprises instruments and software
used to obtain insights and knowledge from data collected from various data sources and stored within the data
warehouse.
The Pattern evaluation module is primarily responsible for the measure of investigation of the pattern by using a threshold
value. It collaborates with the data mining engine to focus the search on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining modules to focus the search
towards fascinating patterns. It might utilize a stake threshold to filter out discovered patterns. On the other hand, the
pattern evaluation module might be coordinated with the mining module, depending on the implementation of the data
mining techniques used. For efficient data mining, it is abnormally suggested to push the evaluation of pattern stake as
much as possible into the mining procedure to confine the search to only fascinating patterns.
The graphical user interface (GUI) module communicates between the data mining system and the user. This module helps
the user to easily and efficiently use the system without knowing the complexity of the process. This module cooperates
with the data mining system when the user specifies a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the search or evaluate the
stake of the result patterns. The knowledge base may even contain user views and data from user experiences that might
be helpful in the data mining process. The data mining engine may receive inputs from the knowledge base to make the
result more accurate and reliable. The pattern assessment module regularly interacts with the knowledge base to get
inputs, and also update it.
DATA MINING AS KDD PROCESS
Knowledge data discovery
For example, one may want to characterize the OurVideoStore customers who
regularly rent more than 30 movies a year. With concept hierarchies on the
attributes describing the target class, the attribute-oriented induction method can be
used, for example, to carry out data summarization. Note that with a data cube
containing summarization of data, simple OLAP operations fit the purpose of data
characterization
Analytical characterization
Analytical characterization is used to help and identifying the weakly relevant, or irrelevant
attributes. We can exclude these unwanted irrelevant attributes when we preparing our data for
the mining.
Data Collection
The data is collected for the target class and its contrasting class.
Preliminary relevance analysis with the help of conservative AOI
We need to decide a set of dimensions and attributes and apply the selected relevance
measure on them. The candidate relation of the mining task is a term used for obtaining the
relation with such an application of Attribute Oriented Induction.
{A,B,C,D<E,F,G,H}
{A,B,E,F,H}
Relevance analysis to remove the irrelevant or weakly relevant attributes
This step consists of steps of Relevance analysis for removing the weakly or irrelevant attribute
Attribute Oriented Induction to generate the concepts
We need to perform the Attribute Oriented Induction. Attribute-Oriented Induction (AOI) is an
algorithm for data summarization. AOI can suffer the problem of over-generalization. Data
summarization is a data mining technique with the help of which we can summarize the big data
in concise understandable knowledge.
Relevance Measures
We can determine the classifying power of an attribute within a set of data with the help of a
Quantitative relevance measure. „
Some competing methods of Relevance Measures are mentioned below;
Gini index„
χ2 contingency table statistics
Gain ratio (C4.5)
Uncertainty coefficient
information gain (ID3)
From Data Analysis point of view, we can classify the data mining into the following two
categories; Predictive data mining
Concept Description
Concept Description is the simplest kind of descriptive data mining. A concept is a term that
can be used for a collection of data. The collection of data examples are mentioned below;
new_students, graduate_students, alumni, and so on.
We can’t say that the data mining task concept description is a simple enumeration of the data.
The solid reason behind it is because the concept description generates descriptions for the
comparison and characterization of the data.
The term concept description is also referred to as class description especially when the
concept to be described is about a class of objects.
Comparison of data
• Comparison of data provides the descriptions of comparing more than one data collection.
Characterization of data
Characterization provides a concise summary of the given collection of data.
Cross-Tabulation:
Mapping results into cross-tabulation form (similar to
contingency tables).
Visualization Techniques:
Pie charts, bar charts, curves, cubes, and other visual
forms.
Quantitative characteristic rules:
Mapping generalized results in characteristic rules
with quantitative information associated with it.
Data Cube Approach
It is nothing but performing computations and storing
results in data cubes.
Strength
An efficient implementation of data generalization.
Computation of various kinds of measures, e.g.,
count( ), sum( ), average( ), max( ).
Generalization and specialization can be performed
on a data cube by roll-up and drill-down.
Limitations
It handles only dimensions of simple non-numeric
data and measures of simple aggregated numeric values.
Lack of intelligent analysis, can’t tell which
dimensions should be used and what levels should the
generalization reach.
Association Rule Mining
A subsequence,
such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently
in a shopping history database, is a (frequent) sequential pattern. If a substructure occurs frequently, it is
called a (frequent) structured pattern. Finding frequent patterns plays an essential role in
mining associations, correlations, and many other interesting relationships among data.
Frequent pattern mining searches for recurring relationships in a given data set.
Each item has a Boolean variable representing the presence or absence of that item. Each basket can then
be represented by a Boolean vector of values assigned to these variables. The Boolean
vectors can be analyzed for buying patterns that reflect items that are frequently associated
or purchased together. These patterns can be represented in the formof association
rules.
For example, the information that customers who purchase computers also tend
to buy antivirus software at the same time is represented in the following association
rule:
A B
Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules. A support of 2% for
Rule (6.1) means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together.
A confidence of 60% means that 60% of
the customers who purchased a computer also bought the software. association
rules are considered interesting if they satisfy both a minimum support threshold
and a minimum confidence threshold.
II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support
then remove those items). This gives us itemset L1.
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of
joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent
remove that itemset.(Example subset of{I1, I2} are {I1}, {I2} they are
frequent.Check for each itemset)
Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L2.
{I1,I2}{I1,I3}
{I1,I2,I3} =frequent item set
{I1,I2}{I1,I5}
{I1,I2,I5}=frequent item set
{i1,i2}{i2,i3}={i1,i2,i3}
{i1,i2}{i2,i4}={i1,i2,i4}
{i1,i4} infrequent ie why I remove this {i1,i2,i4} itemset
{i2,i3,i4}
{i2,i4,i5}
Step-3:
Generate candidate set C3 using L2 (join step). Condition of joining L k-
and Lk-1 is that it should have (K-2) elements in common. So here, for L2,
1
(II) Compare candidate (C3) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L3.
Step-4:
{i1,i2,i3,i5} it is not frequent
Generate candidate set C4 using L3 (join step). Condition of joining L k-
1 and Lk-1 (K=4) is that, they should have (K-2) elements in common. So
Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of
each rule.
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and
bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule
generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
A=[I1^I2]---> B[I3]
sup(I1^I2^I3)/sup(I1^I2)= 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel
association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework.
Rules at high concept level may add to common sense while rules at low concept level may not be useful always.
o Using uniform minimum support for all levels:
When a uniform minimum support threshold is used, the search procedure is simplified.
The method is also simple, in that users are required to specify only one minimum support threshold.
The same minimum support threshold is used when mining at each level of abstraction.
For example, in Figure, a minimum support threshold of 5% is used throughout.
(e.g. for mining from “computer” down to “laptop computer”).
Both “computer” and “laptop computer” are found to be frequent, while “desktop computer” is not.
Using reduced minimum support at lower levels:
o Each level of abstraction has its own minimum support threshold.
o The deeper the level of abstraction, the smaller the corresponding threshold is.
o For example in Figure, the minimum support thresholds for levels 1 and 2 are 5% and 3%, respectively.
o In this way, “computer,” “laptop computer,” and “desktop computer” are all considered frequent.
ALL
anti-virus printer
laptop desktop
office camera
The minimum support level of the highest abstraction is higher as compared to the minimum support lower levels.
When Concept hierarchy is added with the support framework only then we can extract the association rules. If
greater support count is taken then only the highest level would qualify for rule formation as well as less.
2) Reduced Support
Mining multidimensional association rules,that is ,rules involving more than one dimension or
predicate(e.g.,rules relating what a customer buys as well as the customer's age).These methods can be
organized according to their treatment of quantitative attributes.
Multi‐Dimensional Association
• Single‐dimensional rules
buys(X, “milk”) buys(X, “bread”)
• Multi‐dimensional rules
• Search for frequent k‐predicate set: – Example: {age, occupation, buys} is a 3‐predicate set. –
Techniques can be categorized by how age are treated.
3. Distance‐based association rules – This is a dynamic discretization process that considers the
distance between data points.
all
income buys
age
income, buys
age,income age,buy
s
age,income,buys
2.Using dynamic discritization of quantitative attributes: