You are on page 1of 15

Unit-2

Question and Answers


1a) Define frequent itemset?
Ans: A frequent itemset typically refers to a set of items that often appear together in a
transactional data set—for example, milk and bread, which are frequently bought together
in grocery stores by many customers.
1b) Show how to compute confidence of an association rule? Give example
Ans confidence(A⇒B) = P(B|A)

Example:

1c) Explain in detail about frequent pattern mining in data mining.


Ans: Frequent patterns are patterns that appear in a data set frequently
• There are many kinds of frequent patterns, including frequent itemsets, frequent
subsequences (also known as sequential patterns), and frequent substructures.
• A frequent itemset typically refers to a set of items that often appear together in a
transactional data set—for example, milk and bread, which are frequently bought
together in grocery stores by many customers.
• A frequently occurring subsequence, such as the pattern that customers, tend to
purchase first a laptop, followed by a digital camera, and then a memory card, is a
(frequent) sequential pattern.
• A substructure can refer to different structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or subsequences. If a substructure
occurs frequently, it is called a (frequent) structured pattern.
Market Basket Analysis:
• A typical example of frequent itemset mining is market basket analysis.
• This process analyzes customer buying habits by finding associations between the
different items that customers place in their “shopping baskets” .
• The discovery of such associations can help retailers develop marketing strategies by
gaining insight into which items are frequently purchased together by customers.


• For instance, if customers are buying milk, how likely are they to also buy bread on
the same trip.
• Such information can lead to increased sales by helping retailers do selective
marketing and plan their shelf space.
Association rules:
• Buys(X,”Milk”)=>Buys(X,”Bread”)[Support=75%,
Confidence=100%]
• support(A⇒B) = P(A∪B)
• confidence(A⇒B) = P(B|A)
• If the relative support of an itemset I satisfies a prespecified minimum support
threshold.
In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, min support.
2. Generate strong association rules from the frequent itemsets: By definition, these
rules must satisfy minimum support and minimum confidence.

2 a)Define maximal frequent item set.


Ans: Maximal frequent itemset (or max-itemset) : An itemset is maximal frequent if none
of its immediated supersets is frequent.
Ex: {A,B}=1, {A,C}=1, {A}=2. If minimum support=2, then {A,B},{A,C} are not frequent ,
which is immediate supersets of {A} ,Now {A} is Maximal Frequent itemset.
2b) Describe about association rule mining?
Ans: In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least
as frequently as a predetermined minimum support count, min support.
2. Generate strong association rules from the frequent itemsets: By definition, these
rules must satisfy minimum support and minimum confidence.
2c) Give an overview of correlation analysis.
Ans: Correlation analysis is a statistical method used to measure the strength of the linear
relationship between two variables and compute their association. Correlation analysis
calculates the level of change in one variable due to the change in the other. A high correlation
points to a strong relationship between the two variables, while a low correlation means that
the variables are weakly related.
As the correlation coefficient value goes towards 0, the relationship between the two variables
will be weaker. The coefficient sign indicates the direction of the relationship; a + sign indicates
a positive relationship, and a - sign indicates a negative relationship.
Lift is a simple correlation measure that is given as follows. The occurrence of itemset A is
independent of the occurrence of itemset B if P(A ∪B) = P(A)P(B); otherwise, itemsets A and
B are dependent and correlated as events. This definition can easily be extended to more than
two itemsets. The lift between the occurrence of A and B can be measured by computing

lift(A, B) = P(A ∪B)/ P(A)P(B)


The second correlation measure is χ 2 measure,
χ 2 = ℇ (observed − expected) 2 /expected
2d) Explain the measures of association rule mining? Explain
Ans: The measures of association rule mining is Support and Confidence.
Association rules:
• Buys(X,”I1”)=>Buys(X,”I2”)[Support=44%, Confidence=66%]
• support(A⇒B) = P(A∪B)
• confidence(A⇒B) = P(B|A)


• If the relative support of an itemset I satisfies a prespecified minimum support
threshold.
Example:

3a) Describe maximal frequent itemset?


Ans: : Maximal frequent itemset (or max-itemset) : An itemset is maximal frequent if
none of its immediated supersets is frequent.
Ex: {A,B}=1, {A,C}=1, {A}=2. If minimum support=2, then {A,B},{A,C} are not frequent ,
which is immediate supersets of {A} ,Now {A} is Maximal Frequent itemset.
3b) Differentiate frequent subsequence and frequent substructure.
Ans: A frequently occurring subsequence, such as the pattern that customers, tend to purchase
first a laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential
pattern.
A substructure can refer to different structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or subsequences. If a substructure occurs
frequently, it is called a (frequent) structured pattern.

3c) Compute all the frequent item sets using Apriori algorithm for the given data where
min-sup = 2.

Ans:
Therefore the Items in L1,L2 ,L3 are the Frequent Itemsets.
4a) Define Support of an association rule.
Ans: Support is a measure of the number of times an item set appears in a dataset.
support(A⇒B) = P(A∪B)
• Buys(X,”Milk”)=>Buys(X,”Bread”)[Support=75%, Confidence=100%]

4 b) Define Association rule mining two step process


Ans: : In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least
as frequently as a predetermined minimum support count, min support.
2. Generate strong association rules from the frequent itemsets: By definition, these
rules must satisfy minimum support and minimum confidence.
4c) Apply FP-Growth algorithm to the following data for finding frequent item sets,
consider support threshold as 30%.

TID List of ItemIDs


1 I1, I2, I4, I5
2 I2, I4, I7
3 I2,I3,I4,I5
4 I1,I3,I4,I7
5 I1,I2,I3,I4,I5
6 I3,I4,I5,I6

Ans:
4d) Explain in detail about multilevel association rules.
Ans: Association rules generated from mining data at multiple levels of abstraction are called
multiple-level or multilevel association rules. Multilevel association rules can be mined
efficiently using concept hierarchies under a support-confidence framework.
i) Using uniform minimum support for all levels (referred to as uniform
support):
• The same minimum support threshold is used when mining at each level of abstraction.
• For example a minimum support threshold of 5% is used throughout (e.g., for mining
from “computer” down to “laptop computer”). Both “computer” and “laptop computer”
are found to be frequent, while “desktop computer” is not.

ii) Using reduced minimum support at lower levels (referred to as reduced


support):
• Each level of abstraction has its own minimum support threshold.
• The deeper the level of abstraction, the smaller the corresponding threshold
iii) Using item or group-based minimum support (referred to as group-based
support):
Because users or experts often have insight as to which groups are more important
than others, it is sometimes more desirable to set up user-specific, item, or group
based minimal support thresholds when mining multilevel rules.

5a) Explain the purpose of Apriori algorithm.

Ans: Apriori algorithm is used to identify frequent itemsets in a dataset & generate an
association based rule based on the itemsets
5b) Give a note on Closed Frequent Item
Ans: Closed Frequent Itemset: An itemset is closed if none of its immediate supersets has
the same support as that of the itemset.
Ex: {A,B}=3, {A,C}=3, {A}=4. let us comsider {A,B},{A,C} are immediate superset
of {A}, Which has less support count Then {A}. Then {A} is a Closed Itemset.

5c) Describe various types of association rules.


Ans: Association rules that imply a single predicate, that is, the predicate buys Is called as
single dimensional or intra dimensional association rule
buys(X, “digital camera”) ⇒ buys(X, “HP printer”)
• It contains a single distinct predicate (e.g., buys) with multiple occurrences (i.e., the
predicate occurs more than once within the rule
• Association rules that involve two or more dimensions or predicates can be referred to
as multidimensional association rules
• age(X, “20...29”)∧occupation(X, “student”)⇒buys(X, “laptop”).
• Each of which occurs only once in the rule. Hence, we say that it has no repeated
predicates.
• Multidimensional association rules with no repeated predicates are called
interdimensional association rules.
• multidimensional association rules with repeated predicates, which contain multiple
occurrences of some predicates. These rules are called hybrid-dimensional
association rule
Example: age(X, “20...29”)∧buys(X, “laptop”)⇒buys(X, “HP printer”)
5d)Explain the constraint based association mining.
Ans: users have a good sense of which “direction” of mining may lead to interesting patterns
and the “form” of the patterns or rules they want to find.
• They may also have a sense of “conditions” for the rules, which would eliminate the
discovery of certain rules that they know would not be of interest.
• Thus, a good heuristic is to have the users specify such intuition or expectations as
constraints to confine the search space.
• This strategy is known as constraint-based mining.
The constraints can include the following:
• Knowledge type constraints: These specify the type of knowledge to be mined, such as
association, correlation, classification, or clustering.
• Data constraints: These specify the set of task-relevant data.
• Dimension/level constraints: These specify the desired dimensions (or attributes) of the
data, the abstraction levels, or the level of the concept hierarchies to be used in mining
• Interestingness constraints: These specify thresholds on statistical measures of rule
interestingness such as support, confidence, and correlation.
• Rule constraints: These specify the form of, or conditions on, the rules to be mined.
Such constraints may be expressed as metarules (rule templates), as the maximum or
minimum number of predicates that can occur in the rule antecedent or consequent, or
as relationships among attributes, attribute values, and/or aggregates.
6a) Quote an example for quantitative association rule.
Ans: Quantitative association rules are multidimensional association rules in which the
numeric attributes are dynamically discretized.
age(X, “30...39”)∧income(X, “42K...48K”)⇒buys(X, “HDTV”)
6b) Write the FP-growth algorithm.
Ans:

6c) Explain about the identification of sub graphs in a graph


Ans: Graph mining
Graph mining is a process in which the mining techniques are used in finding a pattern or
relationship in the given real-world collection of graphs. By mining the graph, frequent
substructures and relationships can be identified which helps in clustering the graph sets,
finding a relationship between graph sets, or discriminating or characterizing graphs.
Predicting these patterning trends can help in building models for the enhancement of any
application that is used in real-time. To implement the process of graph mining, one must
learn to mine frequent subgraphs.
Frequent Subgraph Mining

Let us consider a graph h with an edge set E(h) and a vertex set V(h). Let us consider the
existence of subgraph isomorphism from h to h’ in such a way that h is a subgraph of h’. A
label function is a function that plots either the edges or vertices to a label. Let us consider a
labeled graph dataset, F=H1,H2 ,H3….Hn Let us consider s(h) as the support which means the
percentage of graphs in F where h is a subgraph. A frequent graph has support that will be
no less than the minimum support threshold. Let us denote it as min_support.
Steps in finding frequent subgraphs:
There are two steps in finding frequent subgraphs.
 The first step is to create frequent substructure candidates.
 The second step is to find the support of each and every candidate. We must
optimize and enhance the first step because the second step is an NP-completed
set where the computational complexity is accurate and high.

6d) Explain about the SPM.

Ans: Sequential pattern mining is the mining of frequently appearing series events or
subsequences as patterns. An instance of a sequential pattern is users who purchase a Canon
digital camera are to purchase an HP color printer within a month.

For retail information, sequential patterns are beneficial for shelf placement and promotions.
This industry, and telecommunications and different businesses, can also use sequential
patterns for targeted marketing, user retention, and several tasks.

There are several areas in which sequential patterns can be used such as Web access pattern
analysis, weather prediction, production processes, and web intrusion detection.

Given a set of sequences, where each sequence includes a file of events (or elements) and each
event includes a group of items, and given a user-specified minimum provide threshold of min
sup, sequential pattern mining discover all frequent subsequences, i.e., the subsequences whose
occurrence frequency in the group of sequences is no less than min_sup.

Methods for Sequential Pattern Mining:


 Apriori-based Approaches
 GSP
 SPADE
 Pattern-Growth-based Approaches
 FreeSpan
 PrefixSpan
Sequence Database: A database that consists of ordered elements or events is called a
sequence database. Example of a sequence database.

S.No. SID Sequences

1. 100 <a(ab)(ac)d(cef)> or <a{ab}{ac}d{cef}>

2. 200 <(ad)c(bcd)(abe)>

3. 300 <(ef)(ab)(def)cb>

4. 400 <eg(adf)CBC>

<a(ab)(ac)d(cef)> is a sequence whereas (a), (ab), (ac),


(d) and (cef) are the elements of the sequence.
These elements are sometimes referred as transactions.
An element may contain a set of items. Items within an element are unordered and we list
them alphabetically.
For example, (cef) is the element and it consists of 3 items c, e and f.
Since, all three items belong to same element, their order does not matter. But we prefer to
put them in alphabetical order for convenience.
The order of the elements of the sequence matters unlike order of items in same transaction.

You might also like