Professional Documents
Culture Documents
For many applications, it is difficult to find strong associations among data items at low or primitive
levels of abstraction due to the sparsity of data at those levels. Strong associations discovered at high
levels of abstraction may represent commonsense knowledge.
For many applications, it is difficult to find strong associations among data items at low or primitive
levels of abstraction due to the sparsity of data at those levels. Strong associations discovered at high
levels of abstraction may represent commonsense knowledge.
Therefore, data mining systems should provide capabilities for mining association rules at multiple levels
of abstraction, with sufficient flexibility for easy traversal among different abstraction spaces.
Mining multilevel association rules: Suppose we are given the task-relevant set of transactional data in
Table for sales in an All Electronics store, showing the items purchased for each transaction.
The concept hierarchy for the items is shown in Figure. A concept hierarchy defines a sequence of
mappings from a set of low-level concepts to higher level, more general concepts. Data can be
generalized by replacing low-level concepts within the data by their higher-level concepts, or ancestors,
from a concept hierarchy.
Association rules generated from mining data at multiple levels of abstraction are called multiple-level
or multilevel association rules. Multilevel association rules can be mined efficiently using concept
hierarchies under a support-confidence framework. In general, a top-down strategy is employed, for
each level; any algorithm for discovering frequent itemsets may be used, such as Apriori or its variations.
Using uniform minimum support for all levels (referred to as uniform support): The same minimum
support threshold is used when mining at each level of abstraction. For example, in Figure 5.11, a
minimum support threshold of 5% is used throughout (e.g., for mining from “computer” down to “laptop
computer”). Both “computer” and “laptop computer” are found to be frequent, while “desktop
computer” is not.
When a uniform minimum support threshold is used, the search procedure is simplified. The method is
also simple in that users are required to specify only one minimum support threshold. An Apriori-like
optimization technique can be adopted, based on the knowledge that an ancestor is a superset of its
descendants: The search avoids examining itemsets containing any item whose ancestors do not have
minimum support.
Using reduced minimum support at lower levels (referred to as reduced support): Each level of
abstraction has its own minimum support threshold. The deeper the level of abstraction, the smaller the
corresponding threshold is. For example, in Figure, the minimum support thresholds for levels 1 and 2
are 5% and 3%, respectively. In this way, “computer,” “laptop computer,” and “desktop computer” are
all considered frequent.
Because users or experts often have insight as to which groups are more important than others, it is
sometimes more desirable to set up user-specific, item, or group based minimal support thresholds
when mining multilevel rules. For example, a user could set up the minimum support thresholds based
on product price, or on items of interest, such as by setting particularly low support thresholds for
laptop computers and flash drives in order to pay particular attention to the association patterns
containing items in these categories.
2. Mining Multidimensional Association Rules from Relational Databases and Data Warehouses
We have studied association rules that imply a single predicate, that is, the predicate buys. For instance,
in mining our All Electronics database, we may discover the Boolean association rule
Following the terminology used in multidimensional databases, we refer to each distinct predicate in a
rule as a dimension. Hence, we can refer to Rule above as a single dimensional or intra dimensional
association rule because it contains a single distinct predicate (e.g., buys) with multiple occurrences (i.e.,
the predicate occurs more than once within the rule). As we have seen in the previous sections of this
chapter, such rules are commonly mined from transactional data.
Considering each database attribute or warehouse dimension as a predicate, we can therefore mine
association rules containing multiple predicates, such as
Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules. Rule above contains three predicates (age, occupation, and buys),
each of which occurs only once in the rule. Hence, we say that it has no repeated predicates.
Multidimensional association rules with no repeated predicates are called inter dimensional association
rules. We can also mine multidimensional association rules with repeated predicates, which contain
multiple occurrences of some predicates. These rules are called hybrid-dimensional association rules. An
example of such a rule is the following, where the predicate buys is repeated:
Note that database attributes can be categorical or quantitative. Categorical attributes have a finite
number of possible values, with no ordering among the values (e.g., occupation, brand, color).
Categorical attributes are also called nominal attributes, because their values are ―names of things.‖
Quantitative attributes are numeric and have an implicit ordering among values (e.g., age, income, and
price). Techniques for mining multidimensional association rules can be categorized into two basic
approaches regarding the treatment of quantitative attributes.
CONSTRAINT BASED ASSOCIATION RULES:
A data mining process may uncover thousands of rules from a given set of data, most of which
end up being unrelated or uninteresting to the users.
Often, users have a good sense of which “direction” of mining may lead to interesting patterns
and the “form” of the patterns or rules they would like to find.
Thus, a good heuristic is to have the users specify such intuition or expectations as constraints to
confine the search space.
Knowledge type constraints: These specify the type of knowledge to be mined, such as
association or correlation.
o Dimension/level constraints: These specify the desired dimensions (or attributes) of the
data, or levels of the concept hierarchies, to be used in mining.
o Rule constraints: These specify the form of rules to be mined. Such constraints may be
expressed as rule templates, as the maximum or minimum numbers of predicates that
can occur in the rule antecedent or consequent, or as relationships among attributes,
attribute values, and/or aggregates. The above constraints can be specified using a high-
level declarative data mining query language and user interface.
Constraint based association rules: - In order to make the mining process more efficient rule based
constraint mining: - allows users to describe the rules that they would like to uncover. - provides a
sophisticated mining query optimizer that can be used to exploit the constraints specified by the user. -
encourages interactive exploratory mining and analysis.
Given a frequent pattern mining query with a set of constraints C, the algorithm should be:
o Sound: it only finds frequent sets that satisfy the given constraints C.
o Complete: all frequent sets satisfying the given constraints are found.
A naïve solution:
o Find all frequent sets and then test them for constraint satisfaction.
There are two forms of data analysis that can be used for extracting models describing important classes
or to predict future data trends. These two forms are as follows −
Classification
Prediction
Classification models predict categorical class labels; and prediction models predict continuous valued
functions. For example, we can build a classification model to categorize bank loan applications as either
safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on
computer equipment given their income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer (loan applicant)
is risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile, who will
buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical labels.
These labels are risky or safe for loan application data and yes or no for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend during a sale at
his company. In this example we are bothered to predict a numeric value. Therefore the data analysis
task is an example of numeric prediction. In this case, a model or a predictor will be constructed that
predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for numeric prediction.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us understand the working
of classification. The Data Classification process includes two steps −
The classifier is built from the training set made up of database tuples and their associated class
labels.
Each tuple that constitutes the training set is referred to as a category or class. These tuples can
also be referred to as sample, object or data points.
In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy of
classification rules. The classification rules can be applied to the new data tuples if the accuracy is
considered acceptable.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. Preparing the data involves the
following activities −
Data Cleaning − Data cleaning involves removing the noise and treatment of missing values. The
noise is removed by applying smoothing techniques and the problem of missing values is solved
by replacing a missing value with most commonly occurring value for that attribute.
Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis is
used to know whether any two given attributes are related.
Data Transformation and reduction − the data can be transformed by any of the following
methods.
Note − Data can also be reduced by some other methods such as wavelet transformation, binning,
histogram analysis, and clustering.
Here are the criteria for comparing the methods of Classification and Prediction −
Accuracy − Accuracy of classifier refers to the ability of classifier. It predicts the class label
correctly and the accuracy of the predictor refers to how well a given predictor can guess the
value of predicted attribute for a new data.
Speed − this refers to the computational cost in generating and using the classifier or predictor.
Robustness − It refers to the ability of classifier or predictor to make correct predictions from
given noisy data.
Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently;
given large amount of data.
Classification and prediction are two forms of data analysis those can be used to extract models
describing important data classes or to predict future data trends.
Such analysis can help to provide us with a better understanding of the data at large.
The goal of data classification is to organize and categorize data in distinct classes.
The goal of prediction is to forecast or deduce the value of an attribute based on values of other
attributes.
Classification:
Suppose from your past data (train data) you come to know that your best friend likes above
movies.
Now one new movie (test data) released and hopefully you want to know your best friend like it
or not.
If you strongly conformed about chances of liking that movie by your friend, you can take your
friend to movie this weekend.
If you clearly observe the problem it is just whether your friend like or not.
Finding solution to this type of problem is called as classification. This is because we are
classifying the things to their belongings (yes or no, like or dislike )
Keep in mind here we are forecasting discrete value (classification) and the other thing this
classification belongs to supervised learning.
This is because you are learning this from your train data.
Mostly classification is binary classification in which we have to predict whether output belongs
to class 1 or class 2 (class 1 : yes, class 2: no )
We can use classification for predicting more classes too. Like (suppose colors:
RED,GREEN,BLUE,YELLOW,ORANGE)
Prediction:
Suppose from your past data (train data) you come to know that your best friend liked above
movies and you also know how many times each particular movie seen by your friend.
Now one new movie (test data) released same like above, now you are going to find how many
times this present newly released movie will your friend watch is it , 5 times, 6 times,10 times
anything.
If you clearly observe the problem it is about finding the count, sometimes we can say this as
predicting the value.
Keep in mind, here we are forecasting continuous value (Prediction) and the other thing this
prediction is also belongs to supervised learning.
This is because you are learning this from you train data.