You are on page 1of 17

Samarth

5.Mining Frequent Patterns and Cluster Analysis

Frequent Patterns

-Frequent patterns are those patterns which appear frequently in datasets.


-This is typically done by analyzing large datasets to find items or sets of items that
appear together frequently.
-Frequent patterns are useful for recommendation systems.
Suppose a customer bought a Camera then how can you recommend to him which
another product is suitable for a camera.
Like lances, binoculars, camera stand, memory chips etc. For this recommendation we
can used frequent patterns.
-If some customers bought camera with memory card frequently then we can use this
relation for new customers who wish to buy a camera
-Finding frequent patterns from a data set is useful in data classification, clustering and
other data mining operations.
-Frequent pattern mining searches for recurring relationship in stored dataset
-Frequent Pattern Mining is a Data Mining subject with the objective of extracting
frequent item sets from a database
-Frequent item sets play an essential role in many Data Mining tasks and are related to
interesting patterns in data, such as Association Rules.

Applications of Frequent Pattern Mining

Market Basket Analysis


Market basket analysis involves analyzing customer purchase patterns to identify
connections between items and enhance sales strategies.

Web usage mining


Web usage mining is examining user navigation patterns to learn more about how
people use websites. In order to personalize websites and enhance their performance,
frequent pattern mining makes it possible to identify recurrent navigation patterns and
session patterns.

Bioinformatics
In bioinformatics, frequent pattern mining can be used to identify common patterns in
DNA sequences, protein structures, or gene expressions, leading to insights in genetics
and drug design.
Samarth

Market Basket Analysis

-Market basket analysis is a modelling technique which is also called affinity analysis, it
helps identifying which items are likely to be purchased together.
-In simple terms Basically, Market basket analysis in data mining is to analyze the
combination of products which have been bought together.
-This type of analyzing helps the retailers to develop different strategies for their
business.
-Business analysts decide the strategies for the frequently purchased items together by
customers.

e.g In winter season many customers purchased woolen clothes and body moisture
creams. So business analyst gives suggestion to owner of a super market, shopping
mall such as D'Mart, Big Bazar that Give a discount to the customer who purchased
both items together.
So if customers buy bread and butter and see a discount or an offer on eggs, they will
be encouraged to spend more and buy the eggs.This is what market basket analysis is
all about

Market basket analysis mainly works with the ASSOCIATION RULE {IF} -> {THEN}.

IF means Antecedent: An antecedent is an item found within the data


THEN means Consequent: A consequent is an item found in combination with the
antecedent.

For example, IF a customer buys bread, THEN he is likely to buy butter as well.
Association rules are usually represented as: {Bread} -> {Butter}

Applications of Market Basket Analysis


Samarth

● Retail: The most well-known MBA case study is Amazon.com. Whenever you
view a product on Amazon, the product page automatically recommends, "Items
bought together frequently." It is perhaps the simplest and most clean example of
an MBA's cross-selling techniques.
Apart from e-commerce formats, BA is also widely applicable to the in-store retail
segment. Grocery stores pay meticulous attention to product placement based
and shelving optimization. For example, you are almost always likely to find
shampoo and conditioner placed very close to each other at the grocery store.
Walmart's infamous beer and diapers association anecdote is also an example of
Market Basket Analysis.

● Finance/Criminology: Market basket analysis finds great application in detecting


fraud related to credit card usage data.
● Manufacturing: The predictive market basket analysis helps in predicting the
failure of the equipment.
● Bioinformatics/Pharmaceutical: Market basket analysis helps in discovering the
co-occurrence relationship among the pharmaceutically active ingredients and
diagnosis that is prescribed to different groups of patients.
Samarth

● Customer behavior: Using socio-economic and demographic data, market basket


analysis helps in determining associated purchases.
● Medicine: Market basket analysis finds great use in determining symptom
analysis and comorbid conditions in the medical field. It also helps in identifying
the hereditary traits and genes that are associated with local environmental
effects.
● Telecom:Due to the intense competition in the telecom sector, businesses are
paying close attention to the advantages that customers frequently utilize. For
instance, telecom has started to combine TV and Internet bundles with other
affordable internet platforms to reduce migration.
● IBFS: Tracing credit card history is a hugely advantageous Market Business
Opportunity for IBFS organizations. For example, Citibank frequently employs
sales personnel at large malls to lure potential customers with attractive
discounts on the go. They also associate with apps like Swiggy and Zomato to
show customers many offers they can avail of via purchasing through credit
cards. IBFS organizations also use basket analysis to determine fraudulent
claims.

Frequent Itemsets

-Frequent itemsets are the itemset that are appeared together in transactional data.
-Finding frequent patterns, associations, correlations, or causal structures among sets
of items or objects in transaction databases, relational databases, and other information
repositories.
-There are Several algorithms for generating rules have been used. Like Apriori
Algorithm and FP Growth algorithm for generating the frequent itemsets

-A set of items is referred to as an itemset.


-An itemset that contains k items is a k-itemset.

· -The set {computer, antivirus software} is a 2-itemset.

· -The occurrence frequency of an itemset is the number of transactions that


contain the itemset.

· This is also known, simply, as the frequency, support count, or count of the
itemset.
-An itemset X is frequent if X's support is no less than a minimum support threshold
-A frequent itemset is a set of items that appears at least in a pre-specified number of
transactions.
Samarth

Frequent itemsets are typically used to generate association rules.

The procedure to find frequent itemsets:


1.A level wise search may be conducted to find the frequent-1 items (set of size 1),
then proceed to find frequent- 2 items and so on.
2.Next search for all maximal frequent itemsets.

Closed Itemsets and Maximal Itemsets

An itemset is maximal frequent if none of its immediate supersets is frequent. An


itemset is closed if none of its immediate supersets has the same support as the
itemset.
For Example-
Identification
1. First identify all frequent itemsets.
2. Then from this group find those that are closed by checking to see if there exists
a superset that has the same support as the frequent itemset, if there is, the
itemset is disqualified, but if none can be found, the itemset is closed.
An alternative method is to first identify the closed itemsets and then use the
minsup to determine which ones are frequent.

-The lattice diagram above shows the maximal, closed and frequent itemsets.
-The itemsets that are circled with blue are the frequent itemsets.
-The itemsets that are circled with the thick blue are the closed frequent itemsets.
Samarth

-The itemsets that are circled with the thick blue and have the yellow fill are the maximal
frequent itemsets.
- In order to determine which of the frequent itemsets are closed, all you have to do is
check to see if they have the same support as their supersets, if they do they are not
closed.
-For example ad is a frequent itemset but has the same support as abd so it is NOT a
closed frequent itemset; c on the other hand is a closed frequent itemset because all of
its supersets, ac, bc, and cd have supports that are less than 3.
As you can see there are a total of 9 frequent itemsets, 4 of them are closed frequent
itemsets and out of these 4, 2 of them are maximal frequent itemsets. This brings us to
the relationship between the three representations of frequent itemsets.

Association Rules

Association rule is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can
be more profitable.
Association rule learning works on the concept of If and Else Statement, such as if A
then B.

IF means Antecedent: An antecedent is an item found within the data


THEN means Consequent: A consequent is an item found in combination with the
antecedent.
There are three ways to measure association:
SUPPORT
CONFIDENCE
LIFT
Now take an example, suppose 5000 transactions have been made through a popular
eCommerce website. Now they want to calculate the support, confidence, and lift for the
two products, let’s say pen and notebook for example out of 5000 transactions, 500
transactions for pen, 700 transactions for notebook, and 1000 transactions for both.

SUPPORT: It is been calculated with the number of transactions divided by the total
number of transactions made,
Samarth

Support = freq(A,B)/N

support(pen) = transactions related to pen/total transactions

i.e support -> 500/5000=10 percent

CONFIDENCE: It has been calculated for whether the product sales are popular on
individual sales or through combined sales. That is calculated with combined
transactions/individual transactions.
Confidence = freq(A,B)/freq(A)
Confidence = combine transactions/individual transactions
i.e confidence-> 1000/500=20 percent

LIFT: Lift is calculated for knowing the ratio for the sales.
Lift= confidence percent/ support percent
Lift-> 20/10=2
When the Lift value is below 1 means the combination is not so frequently bought by
consumers. But in this case, it shows that the probability of buying both the things
together is high when compared to the transaction for the individual items sold.

Apriori Algorithm

-R. Agarwal and R. Srikantto are the creators of The Apriori algorithm. They created it in
1994 by identifying the most frequent themes through Boolean association rules.
-The algorithm has found great use in performing Market Basket Analysis, allowing
businesses to sell their products more effectively.
-This Algorithm uses 2 steps “Join” and “Prune” to reduce the search space.
-The use of this algorithm is not just for market basket analysis. Various fields, like
healthcare, education, etc, also use it.
-It uses knowledge of frequent itemset properties.
-Main objective of this algorithm is “All subset of a frequent itemset must be
frequent.”
-It searches items level wise.

How does the Apriori algorithm work?


● Step 1 – Find frequent items:
○ It starts by identifying individual items (like products in a store) that appear
frequently in the dataset.
● Step 2 – Generate candidate itemsets:
Samarth

○ Then, it combines these frequent items to create sets of two or more


items. These sets are called “itemsets”.
● Step 3 – Count support for candidate itemsets:
○ Next, it counts how often each of these itemsets appears in the dataset.
● Step 4 – Eliminate infrequent itemsets:
○ It removes itemsets that don’t meet a certain threshold of frequency,
known as the “support threshold”. This threshold is set by the user.
● Repeat Steps 2-4:
○ The process is repeated, creating larger and larger itemsets, until no more
can be made.
● Find associations:
○ Finally, Apriori uses the frequent itemsets to find associations. For
example, if “bread” and “milk” are often bought together, it will identify this
as an association.

Apriori Algorithm for finding frequent itemsets using candidate generation

To count min support we are using formula = given min support*transaction/100

Ex.Consider the given database D and minimum support 50%. Apply the Apriori
algorithm and find frequent itemsets.
TID Items

1 134

2 235

3 1235

4 25

Solution:Calculate min_supp=0.5*4=2 (support count is 2)


(0.5: given minimum support in problem, 4: total transactions in database D)

Step 1: Generate candidate list C1 from D


Samarth

C1= Itemset

4
Step 2: 5 Scan D for count of each candidate and find the support
C1=

Itemset Support Count

1 2

2 3

3 3

4 1

5 3

Step 3:Compare candidate support count with min_supp (i.e. 2)(prune or remove the
itemset which have support count less than min_supp i.e. 2)

L1=

Itemset Support
Count

1 2

2 3

3 3

5 3

Step 4: Generate candidate list C2 from L1


C2=
Samarth

Itemsets

1,2

1,3

1,5

2,3

2,5

3,5

Step 5: Scan D for count of each candidate and find the support.
C2=

Itemset Support
Count

1,2 1

1,3 2

1,5 1

2,3 2

2,5 3

3,5 2

Step 6: Compare candidate support count with min_supp (i.e. 2)(prune or remove the
itemset which have support count less than min_supp i.e. 2)
L2=

Itemse Support Count


t

1,3 2

2,3 2

2,5 3
Samarth

3,5 2

Step 7: Generate candidate list C3 from L2


C3=
Itemsets

1,2,3

1,2,5

1,3,5

2,3,5

Step 8: Scan D for count of each candidate and find the support.
C3=
Itemset Support Count
s

1,2,3 1

1,2,5 1

1,3,5 1

2,3,5 2

Step 9: Compare candidate support count with min_supp (i.e. 2)(prune or remove the
itemset which have support count less than min_supp i.e. 2)
L3=
Itemsets Support Count

2,3,5 2

Here, {2,3,5} is the frequent itemset found by using Apriori method.

Like this other example:-


Samarth
Samarth

Generating association rules from frequent itemsets

Association rule-generation is a two-step process.


-First is to generate an itemset like {Bread, Egg, Milk} and second is to generate a rule
from each itemset like {Bread → Egg, Milk}, {Bread, Egg → Milk} etc.
1. Generating itemsets from a list of items
First step in generation of association rules is to get all the frequent itemsets on which
binary partitions can be performed to get the antecedent and the consequent.
2. Generating all possible rules from the frequent itemsets
-Once the frequent itemsets are generated, identifying rules out of them is
comparatively less taxing. Rules are formed by binary partition of each itemset. If
{Bread,Egg,Milk,Butter} is the frequent itemset, candidate rules will look like:
(Egg, Milk, Butter → Bread), (Bread, Milk, Butter → Egg), (Bread, Egg → Milk, Butter),
(Egg, Milk → Bread, Butter), (Butter→ Bread, Egg, Milk)

Ex.Consider the above solved example.Let’s assume a minimum confidence value is


60%.
-For every subsets S of I,you output the rule
-S -> (I-S)(means S recommends I-S)
If Support (I) /Support (S)>=min_confvalue {2,3,5}

Create nonempty subset {2,3,5} -> {2,3} {2,5} {3,5} {2} {3} {5}

Rule 1= 2 & 3 → 5
confidence=Support(2,3,5)/Support(2,3)=2/2*100=100%

Rule 2= 2 & 5 →3
Samarth

confidence=Support(2,5,3)/Support(2,5)=2/3*100=66.66%

Rule 3= 3 & 5 →2
confidence=Support(3,5,2)/Support(3,5)=2/2*100=100%

Rule 4= 2 → 3 & 5
confidence=Support(2,3,5)/Support(2)=2/3*100=66.66%

Rule 5= 3 → 2 & 5
confidence=Support(3,2,5)/Support(3)=2/3*100=66.66%

Rule 6= 5 → 2 & 3
confidence=Support(5,2,3)/Support(5)=2/3*100=66.66%

As confidence value is given 60%,all above rules are the final output i.e.Selected

If Minimum confidence threshold is 70% given,then only 2 rules are the output.
1= 2 & 3 → 5
2= 3 & 5 →2

Cluster Analysis:

-Cluster is a group of objects that belongs to the same class.


-In other words, similar objects are grouped in one cluster and dissimilar objects are
grouped in another cluster.
-Clustering is a data mining technique used to place data elements into related groups
without advance knowledge.
-Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters.
-The set of clusters resulting from a cluster analysis can be referred to as a clustering.
-Cluster analysis or simply clustering is the process of partitioning a set of data objects
(or observations) into subsets.
-A cluster contains data objects which have high inter similarity and low intra similarity.
Samarth

Requirements of Cluster Analysis:

Scalability: Need highly scalable clustering algorithms to deal with large databases.

Ability to deal with different kinds of attributes: Algorithms should be capable to be


applied on any kind of data such as interval-based (numerical) data, categorical,and
binary data.

Discovery of clusters with attribute shape: The clustering algorithm should be capable of
detecting clusters of arbitrary shape. They should not be bounded to only distance
measures that tend to find spherical clusters of small sizes.

High dimensionality: the clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.

Ability to deal with noisy data: Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability: The clustering results should be interpretable, comprehensible, and
usable.

General Overview of Clustering Methods

A good clustering method will produce high quality clusters with:


-High Intra-class similarity
-Low inter-class similarity

Major clustering methods can be classified into the following categories:

1. Partitioning methods: In Partitioning based approach, various partitions are created


and then they are evaluated based on certain criteria.
Samarth

2.Hierarchical methods: The set of data objects are decomposed hierarchically using
certain criteria.
3. Density-based methods: This approach is based on density (local cluster criteria) For
e.g. Density connected points.
4. Grid-based methods: This approach is based on multi-resolution grid data structure.

Jiawei Han and Kamber has given the overview of above mentioned clustering methods
as shown in the Table
Method General characteristics

Partitioning methods -Find mutually exclusive clusters of spherical shape.


-Distance-based.
-May use mean or medoid (etc.) to represent cluster center.
-Effective for small to medium sized data sets.
Common Algorithms:
1.K-means clustering
2.CLARANS (Clustering Large Applications based upon Randomized
Search)

Hierarchical -Clustering is a hierarchical decomposition (l.e., multiple levels).


methods -Cannot correct erroneous merges or splits.
-May Incorporate other techniques like micro-clustering or consider
object "linkages".

CURE (Clustering Using Representatives)


BIRCH (Balanced Iterative Reducing Clustering and using Hierarchies)

Density-based -Can find arbitrarily shaped clusters.


methods -Clusters are dense regions of objects in space that are separated by
low- density regions.
-Cluster density: Each point must have a minimum number of points
within its "neighbourhood".
-May filter out outliers.
● DBSCAN (Density-based Spatial Clustering of Applications with
Noise)
● OPTICS (Ordering Points to Identify Clustering Structure)

Grid-based methods -Use a multi-resolution grid data structure.


-Fast processing time (typically independent of the number of data
objects, yet dependent on grid size).
Samarth

● STING (Statistical Information Grid),


● Wave cluster,
● CLIQUE (Clustering In Quest)

General Applications of Clustering

Clustering algorithms can be applied in many disciplines:

Marketing: Clustering can be used for targeted marketing. For e.g. Given a customer
database, containing properties and past buying records. Similar groups of customers
can be identified and grouped into one cluster.
Biology: Clustering can also be used in classifying plants and animals into different
classes based on their features.
Libraries: Based on different details about books, clustering can be used for book
ordering.
Insurance: With the help of clustering different groups of policy holders can be
identified. For e.g. policy holders with high average claim cost or identifying some
frauds.
City-Planning: Using details like house type, geographical locations, groups of houses
can be identified using clustering.
Earthquake Studies: Clustering can also be used to identify dangerous zones based on
earthquake epicenters.
WWW: Clustering can be used to find groups of similar access patterns using weblog
data. It can also be used for classification of documents.

You might also like