You are on page 1of 13

Unsupervised Learning

• Association
• Clustering
Association
• Affinity analysis – what goes with what
• Groups of items purchased together (Store layout &
Cross selling)
• If-then clause (One or more with one)
• Antecedent/Consequent –
Disjoint/Non-trivial/Interpretable
(Buyers of an insurance policy will also buy a car with probability of 0.98)
• Application: Recommender systems in e-commerce /
Web mining / Credit card purchase
• Identification of frequent item sets
Apriori algorithm
• Method to extract strong rules in a set of transactions
• Principle: If an item set is frequent, then all its subsets are also
frequent
• Item set {a,b,c} -> List subsets - > how many?
• {a}, {b}, {c}, {a,b}, {b,c} , {c,a}
• Principle is based on the concept of identification of cardinality
Types of association rules

1. Symmetric binary attributes (Male/Female)


2. Continuous attributes
3. Multi-dimensional
4. Sequential
5. Multi-level
Two stage algorithm
• First stage – Generation of frequent item sets
• Scanning of data sets to compute frequency of each object
• Frequency < s(min) – discard
• 1 item set frequency generate followed by 2 item and so on
till k item (cardinality)
• Support < s(min) – discard
• Repeat algorithm iteratively
Problem: Set s(min) = 0.2; How many frequent item sets
found?
• Sample: Iteration 1: relative frequency for frequency 1
item set
Item set Relative frequency Status

{a} 7/10 = 0.7 Frequent

{b} 0.7 Frequent

{c} 0.5 Frequent

{d} 0.3 Frequent

{e} 0.3 Frequent

• Repeat for 2 item set & 3 item set. Do you have 4 item
set?
Cluster Analysis
Cluster Analysis
• Large databases – disadvantage of classification
• Grouping of data into clusters – Objects within the
cluster have high similarity as compared to objects in
other clusters which would be dissimilar
• Partitioning of data into groups with high data similarity
and assigning labels
• Application –
• market research, pattern recognition (outlier), image
processing, classification of documents in web,
detection of credit card fraud, insurance policy holders
with high average claim, land use in earth observation,
monitoring criminal activities in electronic commerce
• Unsupervised learning – learns by observation
• Requirements of clustering –
• Scalability
• Ability to deal with different types of attributes
• Could be of any arbitrary shape
• Handling multi dimensional data
• Should be able to handle noisy data
• Interpretable and usable
Major clustering methods
Partitioning method – used in small to medium db
• Database of n objects & k partitions
• Cluster representation k<=n
• A) each group must contain at least one object
• B) each object must belong to exactly one gp
• Iterative relocation technique
• Good partitioning technique – objects in same cluster
closer & objects in different cluster far apart.
k-means partitioning algorithm – non
categorical attributes
• Input: k- number of clusters & D : dataset containing n
objects
• Output: Set of k clusters
• Steps:
• Choose k arbitrarily and cluster centers (mean)
• Repeat
• Reassign each object to the cluster to which the object is
very similar (mean)
• Update the cluster means
• Until it stops
• Sensitive to outliers

You might also like