Professional Documents
Culture Documents
Frequent Patterns
Bioinformatics
In bioinformatics, frequent pattern mining can be used to identify common patterns in
DNA sequences, protein structures, or gene expressions, leading to insights in genetics
and drug design.
Samarth
-Market basket analysis is a modelling technique which is also called affinity analysis, it
helps identifying which items are likely to be purchased together.
-In simple terms Basically, Market basket analysis in data mining is to analyze the
combination of products which have been bought together.
-This type of analyzing helps the retailers to develop different strategies for their
business.
-Business analysts decide the strategies for the frequently purchased items together by
customers.
e.g In winter season many customers purchased woolen clothes and body moisture
creams. So business analyst gives suggestion to owner of a super market, shopping
mall such as D'Mart, Big Bazar that Give a discount to the customer who purchased
both items together.
So if customers buy bread and butter and see a discount or an offer on eggs, they will
be encouraged to spend more and buy the eggs.This is what market basket analysis is
all about
Market basket analysis mainly works with the ASSOCIATION RULE {IF} -> {THEN}.
For example, IF a customer buys bread, THEN he is likely to buy butter as well.
Association rules are usually represented as: {Bread} -> {Butter}
● Retail: The most well-known MBA case study is Amazon.com. Whenever you
view a product on Amazon, the product page automatically recommends, "Items
bought together frequently." It is perhaps the simplest and most clean example of
an MBA's cross-selling techniques.
Apart from e-commerce formats, BA is also widely applicable to the in-store retail
segment. Grocery stores pay meticulous attention to product placement based
and shelving optimization. For example, you are almost always likely to find
shampoo and conditioner placed very close to each other at the grocery store.
Walmart's infamous beer and diapers association anecdote is also an example of
Market Basket Analysis.
Frequent Itemsets
-Frequent itemsets are the itemset that are appeared together in transactional data.
-Finding frequent patterns, associations, correlations, or causal structures among sets
of items or objects in transaction databases, relational databases, and other information
repositories.
-There are Several algorithms for generating rules have been used. Like Apriori
Algorithm and FP Growth algorithm for generating the frequent itemsets
· This is also known, simply, as the frequency, support count, or count of the
itemset.
-An itemset X is frequent if X's support is no less than a minimum support threshold
-A frequent itemset is a set of items that appears at least in a pre-specified number of
transactions.
Samarth
-The lattice diagram above shows the maximal, closed and frequent itemsets.
-The itemsets that are circled with blue are the frequent itemsets.
-The itemsets that are circled with the thick blue are the closed frequent itemsets.
Samarth
-The itemsets that are circled with the thick blue and have the yellow fill are the maximal
frequent itemsets.
- In order to determine which of the frequent itemsets are closed, all you have to do is
check to see if they have the same support as their supersets, if they do they are not
closed.
-For example ad is a frequent itemset but has the same support as abd so it is NOT a
closed frequent itemset; c on the other hand is a closed frequent itemset because all of
its supersets, ac, bc, and cd have supports that are less than 3.
As you can see there are a total of 9 frequent itemsets, 4 of them are closed frequent
itemsets and out of these 4, 2 of them are maximal frequent itemsets. This brings us to
the relationship between the three representations of frequent itemsets.
Association Rules
Association rule is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can
be more profitable.
Association rule learning works on the concept of If and Else Statement, such as if A
then B.
SUPPORT: It is been calculated with the number of transactions divided by the total
number of transactions made,
Samarth
Support = freq(A,B)/N
CONFIDENCE: It has been calculated for whether the product sales are popular on
individual sales or through combined sales. That is calculated with combined
transactions/individual transactions.
Confidence = freq(A,B)/freq(A)
Confidence = combine transactions/individual transactions
i.e confidence-> 1000/500=20 percent
LIFT: Lift is calculated for knowing the ratio for the sales.
Lift= confidence percent/ support percent
Lift-> 20/10=2
When the Lift value is below 1 means the combination is not so frequently bought by
consumers. But in this case, it shows that the probability of buying both the things
together is high when compared to the transaction for the individual items sold.
Apriori Algorithm
-R. Agarwal and R. Srikantto are the creators of The Apriori algorithm. They created it in
1994 by identifying the most frequent themes through Boolean association rules.
-The algorithm has found great use in performing Market Basket Analysis, allowing
businesses to sell their products more effectively.
-This Algorithm uses 2 steps “Join” and “Prune” to reduce the search space.
-The use of this algorithm is not just for market basket analysis. Various fields, like
healthcare, education, etc, also use it.
-It uses knowledge of frequent itemset properties.
-Main objective of this algorithm is “All subset of a frequent itemset must be
frequent.”
-It searches items level wise.
Ex.Consider the given database D and minimum support 50%. Apply the Apriori
algorithm and find frequent itemsets.
TID Items
1 134
2 235
3 1235
4 25
C1= Itemset
4
Step 2: 5 Scan D for count of each candidate and find the support
C1=
1 2
2 3
3 3
4 1
5 3
Step 3:Compare candidate support count with min_supp (i.e. 2)(prune or remove the
itemset which have support count less than min_supp i.e. 2)
L1=
Itemset Support
Count
1 2
2 3
3 3
5 3
Itemsets
1,2
1,3
1,5
2,3
2,5
3,5
Step 5: Scan D for count of each candidate and find the support.
C2=
Itemset Support
Count
1,2 1
1,3 2
1,5 1
2,3 2
2,5 3
3,5 2
Step 6: Compare candidate support count with min_supp (i.e. 2)(prune or remove the
itemset which have support count less than min_supp i.e. 2)
L2=
1,3 2
2,3 2
2,5 3
Samarth
3,5 2
1,2,3
1,2,5
1,3,5
2,3,5
Step 8: Scan D for count of each candidate and find the support.
C3=
Itemset Support Count
s
1,2,3 1
1,2,5 1
1,3,5 1
2,3,5 2
Step 9: Compare candidate support count with min_supp (i.e. 2)(prune or remove the
itemset which have support count less than min_supp i.e. 2)
L3=
Itemsets Support Count
2,3,5 2
Create nonempty subset {2,3,5} -> {2,3} {2,5} {3,5} {2} {3} {5}
Rule 1= 2 & 3 → 5
confidence=Support(2,3,5)/Support(2,3)=2/2*100=100%
Rule 2= 2 & 5 →3
Samarth
confidence=Support(2,5,3)/Support(2,5)=2/3*100=66.66%
Rule 3= 3 & 5 →2
confidence=Support(3,5,2)/Support(3,5)=2/2*100=100%
Rule 4= 2 → 3 & 5
confidence=Support(2,3,5)/Support(2)=2/3*100=66.66%
Rule 5= 3 → 2 & 5
confidence=Support(3,2,5)/Support(3)=2/3*100=66.66%
Rule 6= 5 → 2 & 3
confidence=Support(5,2,3)/Support(5)=2/3*100=66.66%
As confidence value is given 60%,all above rules are the final output i.e.Selected
If Minimum confidence threshold is 70% given,then only 2 rules are the output.
1= 2 & 3 → 5
2= 3 & 5 →2
Cluster Analysis:
Scalability: Need highly scalable clustering algorithms to deal with large databases.
Discovery of clusters with attribute shape: The clustering algorithm should be capable of
detecting clusters of arbitrary shape. They should not be bounded to only distance
measures that tend to find spherical clusters of small sizes.
High dimensionality: the clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
Ability to deal with noisy data: Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability: The clustering results should be interpretable, comprehensible, and
usable.
2.Hierarchical methods: The set of data objects are decomposed hierarchically using
certain criteria.
3. Density-based methods: This approach is based on density (local cluster criteria) For
e.g. Density connected points.
4. Grid-based methods: This approach is based on multi-resolution grid data structure.
Jiawei Han and Kamber has given the overview of above mentioned clustering methods
as shown in the Table
Method General characteristics
Marketing: Clustering can be used for targeted marketing. For e.g. Given a customer
database, containing properties and past buying records. Similar groups of customers
can be identified and grouped into one cluster.
Biology: Clustering can also be used in classifying plants and animals into different
classes based on their features.
Libraries: Based on different details about books, clustering can be used for book
ordering.
Insurance: With the help of clustering different groups of policy holders can be
identified. For e.g. policy holders with high average claim cost or identifying some
frauds.
City-Planning: Using details like house type, geographical locations, groups of houses
can be identified using clustering.
Earthquake Studies: Clustering can also be used to identify dangerous zones based on
earthquake epicenters.
WWW: Clustering can be used to find groups of similar access patterns using weblog
data. It can also be used for classification of documents.