You are on page 1of 104

Please read this disclaimer before proceeding:

This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
CS8091
Big Data Analytics

Department: IT

Batch/Year: 2018-22/ III


Created by: Ms. S. Jhansi Ida,
Assistant Professor, RMKEC

Date: 01.04.2021
Table of Contents
S NO CONTENTS PAGE NO

1 Contents 5

2 Course Objectives 6

3 Pre Requisites (Course Names with Code) 7

4 Syllabus (With Subject Code, Name, LTPC details) 8

5 Course Outcomes 10

6 CO- PO/PSO Mapping 11

7 Lecture Plan 12

8 Activity Based Learning 13

9 3 ASSOCIATION AND RECOMMENDATION SYSTEM

3.1 Association Rules - Overview 14

3.2 Apriori Algorithm 18

3.3 Evaluation of Candidate Rules 21

3.4 Applications of Association Rules 24

3.5 Finding Association & finding similarity 30

3.6 Recommendation System: 36

3.7 Collaborative Recommendation 37


3.8 Content Based Recommendation 52

3.9 Knowledge Based Recommendation 59

3.10 Hybrid Recommendation Approaches 66

10 Assignments 84

11 Part A (Questions & Answers) 85

12 Part B Questions 95

13 Supportive Online Certification Courses 96

14 Real time Applications 97

15 Content Beyond the Syllabus 99

16 Assessment Schedule 102

17 Prescribed Text Books & Reference Books 103

18 Mini project Suggestions 104

5
Course Objectives
To know the fundamental concepts of big data and analytics.
To explore tools and practices for working with big data
To learn about stream computing.
To know about the research that requires the integration of large amounts of
data.
Pre Requisites

CS8391 – Data Structures

CS8492 – Database Management System


Syllabus
LTPC
CS8091 BIG DATA ANALYTICS
3003
UNIT I INTRODUCTION TO BIG DATA 9

Evolution of Big data - Best Practices for Big data Analytics - Big data
characteristics Validating – The Promotion of the Value of Big Data - Big
Data Use Cases- Characteristics of Big Data Applications - Perception and
Quantification of Value - Understanding Big Data Storage - A General
Overview of High- Performance Architecture - HDFS – Map Reduce and
YARN - Map Reduce Programming Model.

UNIT II CLUSTERING AND CLASSIFICATION 9

Advanced Analytical Theory and Methods: Overview of Clustering -


K-means - Use Cases - Overview of the Method - Determining the Number
of Clusters – Diagnostics Reasons to Choose and Cautions .- Classification:
Decision Trees - Overview of a Decision Tree - The General Algorithm -
Decision Tree Algorithms - Evaluating a Decision Tree - Decision Trees in R -
Naïve Bayes - Bayes’ Theorem - Naïve Bayes Classifier.

UNIT III ASSOCIATION AND RECOMMENDATION SYSTEM 9

Advanced Analytical Theory and Methods: Association Rules - Overview -


Apriori Algorithm – Evaluation of Candidate Rules - Applications of
Association Rules - Finding Association& finding similarity -
Recommendation System: Collaborative Recommendation- Content Based
Recommendation - Knowledge Based Recommendation- Hybrid
Recommendation Approaches.
Syllabus
UNIT IV STREAM MEMORY 9
Introduction to Streams Concepts – Stream Data Model and Architecture
- Stream Computing, Sampling Data in a Stream – Filtering Streams –
Counting Distinct Elements in a Stream – Estimating moments –
Counting oneness in a Window – Decaying Window – Real time Analytics
Platform(RTAP) applications - Case Studies - Real Time Sentiment
Analysis, Stock Market Predictions - Using Graph Analytics for Big Data:
Graph Analytics

UNIT V NOSQL DATA MANAGEMENT FOR BIG DATA AND


VISUALIZATION 9
NoSQL Databases : Schema-less Models: Increasing Flexibility for Data
Manipulation- Key Value Stores- Document Stores - Tabular Stores -
Object Data Stores - Graph Databases Hive - Sharding –- Hbase –
Analyzing big data with twitter - Big data for E-Commerce Big data for
blogs - Review of Basic Data Analytic Methods using R.
Course Outcomes
CO# COs K Level

CO1 Identify big data use cases, characteristics and make K3


use of HDFS and Map-reduce programming model for
data analytics

CO2 Examine the data with clustering and classification K4


techniques

CO3 Discover the similarity of huge volume of data with K4


association rule mining and examine recommender system

CO4 Perform analytics on data streams K4

CO5 Inspect NoSQL database and its management K4

CO6 Examine the given data with R programming K4


CO-PO/PSO Mapping

PO PO PO PO PO PO PO PO PO PO PO PO PSO PSO PSO


CO #
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3

CO1 2 3 3 3 3 1 1 - 1 2 1 1 2 2 2

CO2 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2

CO3 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2

CO4 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2

CO5 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1

CO6 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
Lecture Plan
UNIT – III
No
S of Proposed Actual Pertain Taxon Mode of
Topics
No peri date Date ing CO omy delivery
ods level

Association Rules –
1 Overview, 1 CO3 K4 PPT
Apriori Algorithm
Evaluation of candidate
2 rules – Applications of 1 CO3 K4 PPT
Association rule

Finding Association& PPT /


3 1 CO3 K4
finding similarity Problem

Problems on Apriori PPT /


4 1 CO3 K4
Algorithm Problem

Collaborative
5 1 CO3 K4 PPT
Recommendation
Content Based
6 1 CO3 K4 PPT
Recommendation
Knowledge Based
7 1 CO3 K4 PPT
Recommendation
Hybrid Recommendation
8 1 CO3 K4 PPT
Approaches.

Content Beyond the


9 1 CO3 K4 PPT
syllabus & Revision
ACTIVITY BASED LEARNING

Match the Picture with its relevance

Recommendation system,
Market Basket Analysis,
User Rating, Item Rating
Lecture Notes
3.1 Association Rules

Association Rules is a descriptive, not predictive, method often used to


discover interesting relationships hidden in a large dataset. The disclosed
relationships can be represented as rules or frequent item sets.
Association rules are commonly used for mining transactions in
databases.
Association rules are capable of answering questions like:
1. Which products tend to be purchased together?
2.Of those customers who are similar to this person, what products do
they tend to buy?
3.Of those customers who have purchased this product, what other
similar products do they tend to view or purchase?

Given a large collection of transactions (depicted as three stacks of


receipts as shown in below Figure 3.1), in which each transaction consists
of one or more items, association rules go through the items being
purchased to see what items are frequently bought together and to
discover a list of rules that describe the purchasing behavior.
The goal with association rules is to discover interesting relationships
among the items. (The relationship occurs too frequently to be random
and is meaningful from a business perspective, which may or may not be
obvious.) The relationships that are interesting depend both on the
business context and the nature of the algorithm being used for the
discovery.
Each of the uncovered rules is in the form X → Y, meaning that when
item X is observed, item Y is also observed. In this case, the left-hand
side (LHS) of the rule is X, and the right-hand side (RHS) of the rule is Y.

Using association rules, patterns can be discovered from the data that
allow the association rule algorithms to disclose rules of related product
purchases. The uncovered rules are listed on the right side.

Fig.3.1 Three stacks of Receipts


The first three rules suggest that when cereal is purchased, 90% of
the time milk is purchased also. When bread is purchased, 40% of
the time milk is purchased also. When milk is purchased, 23% of the
time cereal is also purchased.
In the example of a retail store, association rules are used over
transactions that consist of one or more items. In fact, because of their
popularity in mining customer transactions, association rules are
sometimes referred to as market basket analysis.

Each transaction can be viewed as the shopping basket of a customer


that contains one or more items. This is also known as an itemset.

The term itemset refers to a collection of items or individual entities that


contain some kind of relationship. This could be a set of retail items
purchased together in one transaction, a set of hyperlinks clicked on by
one user in a single session, or a set of tasks done in one day. An itemset
containing k items is called a k-itemset. The notation uses curly braces
like {item 1,item 2,… item k} to denote a k- itemset. Computation of the
association rules is typically based on itemsets.

Apriori is one of the earliest and the most fundamental algorithms for
generating association rules. It pioneered the use of support for pruning
the itemsets and controlling the exponential growth of candidate
itemsets.

Shorter candidate itemsets, which are known to be frequent itemsets, are


combined and pruned to generate longer frequent itemsets. This
approach eliminates the need for all possible itemsets to be enumerated
within the algorithm, since the number of all possible itemsets can
become exponentially large.
One major component of Apriori is support. Given an itemset L, the
support L is the percentage of transactions that contain L.

For example, if 80% of all transactions contain itemset {bread}, then the
support of {bread} is 0.8. Similarly, if 60% of all transactions contain
itemset {bread,butter}, then the support of {bread,butter} is 0.6.

A frequent itemset has items that appear together often enough. The
term “often enough” is formally defined with a minimum support
criterion. If the minimum support is set at 0.5, any itemset can be
considered a frequent itemset if at least 50% of the transactions contain
this itemset. In other words, the support of a frequent itemset should be
greater than or equal to the minimum support. For the previous example,
both {bread} and {bread,butter} are considered frequent itemsets at the
minimum support 0.5. If the minimum support is 0.7, only {bread} is
considered a frequent itemset.

If an itemset is considered frequent, then any subset of the frequent


itemset must also be frequent. This is referred to as the Apriori property
(or downward closure property).

For example, if 60% of the transactions contain {bread,jam}, then at


least 60% of all the transactions will contain {bread} or {jam}. In other
words, when the support of {bread,jam} is 0.6, the support of {bread} or
{jam} is at least 0.6.
The following diagram Fig.3.2 illustrates how the Apriori property works.
If itemset {B,C,D} is frequent, then all the subsets of this itemset,
shaded, must also be frequent itemsets. The Apriori property provides
the basis for the Apriori algorithm.

Fig.3.2 Frequent Itemsets using Apriori property

3.2 Apriori Algorithm

The Apriori algorithm takes a bottom-up iterative approach to uncover


the frequent itemsets by first determining all the possible items (or 1-
itemsets, for example {bread}, {eggs}, {milk}, …) and then identifying
which among them are frequent.
Assuming the minimum support threshold (or the minimum support
criterion) is set at 0.5, the algorithm identifies and retains those itemsets
that appear in at least 50% of all transactions and discards (or “prunes
away”) the itemsets that have a support less than 0.5 or appear in fewer
than 50% of the transactions. The word prune is used like it would be in
gardening, where unwanted branches of a bush are clipped away.

In the next iteration of the Apriori algorithm, the identified frequent


1-itemsets are paired into 2-itemsets (for example, {bread,eggs},
{bread,milk}, {eggs,milk}, …) and again evaluated to identify the
frequent 2-itemsets among them.

At each iteration, the algorithm checks whether the support criterion can
be met; if it can, the algorithm grows the itemset, repeating the process
until it runs out of support or until the itemsets reach a predefined
length.

Let variable Ck be the set of candidate k-itemsets and variable Lk be the set
of k-itemsets that satisfy the minimum support.

Given a transaction database D, a minimum support threshold δ, and an


optional parameter N indicating the maximum length an itemset could
reach, Apriori iteratively computes frequent itemsets Lk+1 based on Lk.
The first step of the Apriori algorithm is to identify the frequent itemsets by
starting with each item in the transactions that meets the predefined minimum
support threshold ‘δ’. These itemsets are 1-itemsets denoted as L1, as each
1-itemset contains only one item. Next, the algorithm grows the itemsets by
joining L1 onto itself to form new, grown 2-itemsets denoted as L2 and
determines the support of each 2-itemset in L2.

Those itemsets that do not meet the minimum support threshold ‘δ’ are pruned
away. The growing and pruning process is repeated until no itemsets meet the
minimum support threshold. Optionally, a threshold ‘N’ can be set up to specify
the maximum number of items the itemset can reach or the maximum number
of iterations of the algorithm. Once completed, output of the Apriori algorithm is
the collection of all the frequent k-itemsets.

Next, a collection of candidate rules is formed based on the frequent itemsets


uncovered in the iterative process described earlier. For example, a frequent
itemset {milk,eggs} may suggest candidate rules {milk}→{eggs} and
{eggs}→{milk}.
3.3 Evaluation of Candidate Rules

Frequent itemsets can form candidate rules such as X implies Y (X → Y).


Measures such as confidence, lift, and leverage can help evaluate the
appropriateness of these candidate rules.

Confidence is defined as the measure of certainty or trustworthiness


associated with each discovered rule. Confidence is the percent of
transactions that contain both X and Y out of all the transactions that contain
X. It is defined by the following equation,

For example, if {bread,eggs,milk} has a support of 0.15 and {bread,eggs} also has a
support of 0.15, the confidence of rule {bread,eggs}→{milk} is 1, which means 100%
of the time a customer buys bread and eggs, milk is bought as well. The rule is
therefore correct for 100% of the transactions containing bread and eggs.

A relationship is interesting when the algorithm identifies the relationship with a


measure of confidence greater than or equal to a predefined threshold. This predefined
threshold is called the minimum confidence. A higher confidence indicates that the rule
(X → Y) is more interesting or more trustworthy, based on the sample dataset.

Even though confidence can identify the interesting rules from all the candidate rules, it
comes with a problem. Given rules in the form of X → Y, confidence considers only the
antecedent (X) and the co-occurrence of X and Y; it does not take the consequent of
the rule (Y) into concern. Therefore, confidence cannot tell if a rule contains true
implication of the relationship or if the rule is purely coincidental. X and Y can be
statistically independent yet still receive a high confidence score.
Lift measures how many times more often X and Y occur together
than expected if they are statistically independent of each other. Lift is
a measure of how X and Y are really related rather than coincidentally
happening together. It is defined by the following equation,

Lift is 1 if X and Y are statistically independent of each other. In


contrast, a lift of X → Y greater than 1 indicates that there is some
usefulness to the rule. A larger value of lift suggests a greater
strength of the association between X and Y.

Assuming 1,000 transactions, with {milk,eggs} appearing in 300 of


them, {milk} appearing in 500, and {eggs} appearing in 400, then

If {bread} appears in 400 transactions and {milk,bread} appears in


400, then

Therefore it can be concluded that milk and bread have a stronger


association than milk and eggs.
Leverage uses the difference instead of using a ratio. Leverage measures
the difference in the probability of X and Y appearing together in the
dataset compared to what would be expected if X and Y were statistically
independent of each other. It is defined by

Leverage is 0 when X and Y are statistically independent of each other. If


X and Y have some kind of relationship, the leverage would be greater
than zero. A larger leverage value indicates a stronger relationship
between X and Y.
For the previous example,

and

It confirms that milk and bread have a stronger association than milk and
eggs.

Confidence is able to identify trustworthy rules, but it cannot tell whether


a rule is coincidental. A high-confidence rule can sometimes be misleading
because confidence does not consider support of the itemset in the rule
consequent. Measures such as lift and leverage not only ensure interesting
rules are identified but also filter out the coincidental rules.
3.4 Applications of Association Rules

The term market basket analysis refers to a specific implementation of


association rules mining that many companies use for a variety of purposes,
including these:
Broad-scale approaches to better merchandising—what products should be
included in or excluded from the inventory each month.
Cross-merchandising between products and high-margin or high-ticket items -
Physical or logical placement of product within related categories of products .
Promotional programs—multiple product purchase incentives managed through a
loyalty card program.

Besides market basket analysis, association rules are commonly used for
recommender systems and clickstream analysis.
Many online service providers such as Amazon and Netflix use recommender
systems. Recommender systems can use association rules to discover related
products or identify customers who have similar interests. For example,
association rules may suggest that those customers who have bought product A
have also bought product B, or those customers who have bought products A, B,
and C are more similar to this customer. These findings provide opportunities for
retailers to cross-sell their products.

Clickstream analysis refers to the analytics on data related to web browsing and
user clicks, which is stored on the client or the server side. Web usage log files
generated on web servers contain huge amounts of information, and association
rules can potentially give useful knowledge to web usage data analysts. For
example, association rules may suggest that website visitors who land on page X
click on links A, B, and C much more often than links D, E, and F. This observation
provides valuable insight on how to better personalize and recommend the
content to site visitors.
Problem on Apriori Algorithm:

Discuss the apriori algorithm for discovering frequent itemsets. Apply apriori
algorithm to the following data set. Use 0.3 for the minimum support
value. The set of item is {strawberry, litchi, oranges, butterfruit, vanilla,
Banana, apple}.

Minimum support count is 0.3


Step-1: K=1
(i) Create a table of 1-itemset containing support count of each item
present in dataset called C1 (candidate set).

(ii) Compare candidate set item’s support count with minimum support

count 0.3. If support count of candidate set items is less than minimum
support count then remove those items). This gives frequent itemset L1.
Step-2: K=2
Generate candidate set C2 using L1 (called join step). Create a table of 2-
itemset.
Check all subsets of an itemset are frequent or not and if not frequent
remove that itemset.
Now find support count of these itemsets by searching in dataset.

(ii) Compare candidate (C2) support count with minimum support count 0.3.
If support count of candidate set item is less than minimum support then
remove those items). This gives frequent itemset L2.
Step-3:K=3
Generate candidate set C3 using L2 (join step). Create a table of 1-itemset.
Check if all subsets of these itemsets are frequent or not and if not, then
remove that itemset.
Find support count of these remaining itemset by searching in dataset.

Generate Association Rules:

Frequent Itemset={Strawberry, Litchi, Orange}


The Non-empty subsets are ({Strawberry}, {Litchi}, {Orange}, {Strawberry,
Litchi}, {Strawberry, Orange}, {Litchi, Orange})

Resultant Association rules are:

{Strawberry} → {Litchi, Orange}


{Litchi} →{Strawberry, Orange}
{Orange} →{Strawberry, Litchi}
{Strawberry, Litchi}→{Orange}
{Strawberry, Orange}→ {Litchi}
{Litchi, Orange}→ {Strawberry}

Find the confidence for the above Association rules.


Confidence({Strawberry}→{Litchi,Orange})=Support(Strawberry,Litchi,Orange)/
/ Support(Strawberry)
= 3/5 = 60%
Confidence({Litchi}→{Strawberry,Orange})=Support(Strawberry,Litchi,Orange)
/ Support(Litchi)
= 3/4 =75%
Confidence({Orange}→{Strawberry,Litchi})=Support(Strawberry,Litchi,Orange)
/ Support(Orange)
=3/4 =75%
Confidence({Strawberry,Litchi}→{Orange})=Support(Strawberry,Litchi,Orange)
/ Support({Strawberry,Litchi})
=3/4 =75%
Confidence({Strawberry,Orange }→{Litchi}) =
Support(Strawberry,Litchi,Orange)/Support({Strawberry,Orange})
=3/3 =100%
Confidence({Litchi,Orange}→{Strawberry})=Support(Strawberry,Litchi,Orange)
/ Support({Litchi,Orange})
=3/3 =100%

So confidence is 100% for the last two rules, they are considered as strong
association rules.

Association Rule Mining & Apriori Algorithm


https://www.youtube.com/watch?v=guVvtZ7ZClw&list=TLPQMTkwNDIwMjEKcv
GrD1k5qQ&index=1
3.5 Finding Association and Finding Similarity

There are various measures used to find associations and similarity between
different entities. Finding association involves finding the frequent itemsets
based on association rules using any of two methods like Apriori or FP-
Growth methods.

Association rules are used for finding those frequent itemsets whose support
is greater or equal to a support threshold. If we are looking for association
rules X->Y that apply to a reasonable fraction of the baskets in Market-
basket analysis, then the support of X must be reasonably higher than Y.

We also want the confidence of the rule to be reasonably higher to avoid any
practical effects. So, for finding the association between the entities, we
have to find all the itemsets that meet threshold of support and high
confidence value. During the extraction of itemsets, it must be assumed that
there are not too many frequent itemsets and thus not too many candidates
for high-support, high confidence association rules.

This assumption leads to important consequences about the efficiency of


algorithms. Therefore, it must be avoided to have too many candidates for
high support and high confidence association rules.

The similarity between entities can be measured by finding the closest


distance between two data points. The similarity can be found using various
distance measures. A distance measure is the measure of how similar one
observation is compared to a set of other observations.
The various distance measures are
❖Euclidean distance
❖Jaccard distance
❖Cosine distance
❖Edit distance and
❖Hamming distance

Definition of a Distance Measure

Suppose we have a set of points, called a space. A distance measure on this


space is a function d(x, y) that takes two points in the space as arguments
and produces a real number, and satisfies the following axioms:
1. d(x, y) ≥ 0 (no negative distances).
2. d(x, y) = 0 if and only if x = y (distances are positive, except for the
distance from a point to itself).
3. d(x, y) = d(y, x) (distance is symmetric).
4. d(x, y) ≤ d(x, z) + d(z, y) (the triangle inequality).
The triangle-inequality axiom is what makes all distance measures behave as
if distance describes the length of a shortest path from one point to another.

Euclidean Distances:

An n-dimensional Euclidean space is one where points are vectors of n real


numbers. The distance measure in this space, is referred to as the L2-norm,
and defined by:

That is, square the distance in each dimension, sum the squares, and take
the positive square root.
It is easy to verify the first three requirements for a distance measure. The
Euclidean distance between two points cannot be negative, because the
positive square root is intended. Since all squares of real numbers are
nonnegative, any i such that xi ≠ yi forces the distance to be strictly positive.
On the other hand, if xi = yi for all i, then the distance is clearly 0.
Symmetry follows because (xi − yi)2 = (yi − xi)2.
The triangle inequality requires a good deal of algebra to verify and the sum
of the lengths of any two sides of a triangle is no less than the length of the
third side.

Manhattan Distance

There are other distance measures that have been used for Euclidean
spaces. For any constant r, we can define the Lr-norm to be the distance
measure d defined by:

The case r = 2 is the usual L2-norm. Another common distance measure is


the L1-norm, or Manhattan distance. Manhattan distance is the distance
between two points and defined as the sum of the magnitudes of the
differences in each dimension.

Another interesting distance measure is the L∞-norm, which is the limit as r


approaches infinity of the Lr-norm. As r gets larger, only the dimension with
the largest difference matters, so the L∞-norm is defined as the maximum
of |xi − yi | over all dimensions i.
Jaccard Distance

Jaccard distance of sets is defined by d(x, y) = 1 − SIM(x, y). Jaccard


distance is 1 minus the ratio of the sizes of the intersection and union of sets
x and y. To verify that this function is a distance measure,
1. d(x,y) is nonnegative because the size of the intersection cannot
exceed the size of the union.
2. d(x, y) = 0 if x = y, because x ∪ x = x ∩ x = x. However, if x ≠ y,
then the size of x ∩ y is strictly less than the size of x ∪ y, so d(x, y) is
strictly positive.
3. d(x, y) = d(y, x) because both union and intersection are symmetric; i.e.,
x ∪ y = y ∪ x and x ∩ y = y ∩ x.
4. For the triangle inequality, SIM(x, y) is the probability a random minhash
function maps x and y to the same value. Thus, the Jaccard distance d(x, y) is
the probability that a random minhash function does not send x and y to the
same value. Therefore translate the condition d(x, y) ≤ d(x, z) + d(z, y) to
the statement that if h is a random minhash function, then the probability
that h(x) ≠ h(y) is no greater than the sum of the probability that h(x) ≠ h(z)
and the probability that h(z) ≠ h(y). However, this statement is true because
whenever h(x) ≠ h(y), at least one of h(x) and h(y) must be different from
h(z). They could not both be h(z), because then h(x) and h(y) would be the
same.
Cosine Distance

The cosine distance makes sense in spaces that have dimensions, including
Euclidean spaces and discrete versions of Euclidean spaces, such as spaces
where points are vectors with integer components or Boolean (0 or 1)
components. In such a space, points may be thought of as directions.

The cosine distance between two points is the angle that the vectors to
those points make. This angle will be in the range 0 to 180 degrees,
regardless of how many dimensions the space has. The cosine distance is
calculated by first computing the cosine of the angle, and then applying the
arc-cosine function to translate to an angle in the 0-180 degree range.

Given two vectors x and y, the cosine of the angle between them is the dot
product x.y divided by the L2-norms of x and y (i.e., their Euclidean
distances from the origin).

Example: Let the two vectors be x = [1, 2, −1] and = [2, 1, 1]. The dot
product x.y is 1 × 2 + 2 × 1 + (−1) × 1 = 3. The L2-norm of both vectors is
√6. For example, x has L2-norm = √ 6. Thus, the cosine of
the angle between x and y is 3/( √ 6 √ 6) or 1/2. The angle whose cosine is
1/2 is 60 degrees, so that is the cosine distance between x and y.
Edit Distance

Edit distance is defined by =|x|+|y|-2(LCS(x,y)), where |x| & |y| represent


the length of stirngs and LCS represent the Longest Common Sequence.
Example :
The strings x = abcde and y = acfdeg have a unique LCS, which is acde. It is
the longest possible, because it contains every symbol appearing in both x
and y. These common symbols appear in the same order in both strings, so
we are able to use them all in an LCS. The length of x is 5, the length of y is
6, and the length of their LCS is 4. The edit distance is thus 5 + 6 − 2 × 4 =
3.

Consider x = aba and y = bab. Their edit distance is 2. We can convert x to


y by deleting the first a and then inserting b at the end. There are two LCS’s:
ab and ba. Each can be obtained by deleting one symbol from each string.
For multiple LCS’s of the same pair of strings, both LCS’s have the same
length. Therefore, the edit distance is computed as 3 + 3 − 2 × 2 = 2.

Edit distance is a distance measure. No edit distance can be negative, and


only two identical strings have an edit distance of 0. For the edit distance to
be symmetric, note that a sequence of insertions and deletions can be
reversed, with each insertion becoming a deletion, and vice versa. For the
triangle inequality, the way to turn a string s into a string t is to turn s into
some string u and then turn u into t. Thus, the number of edits made going
from s to u, plus the number of edits made going from u to t cannot be less
than the smallest number of edits that will turn s into t.
Hamming Distance

Given a space of vectors, the Hamming distance between two vectors is


defined to be the number of components in which they differ.
The Hamming distance cannot be negative, and if it is zero, then the vectors
are identical. The distance does not depend on which of two vectors we
consider first.
The triangle inequality should also be evident. If x and z differ in m
components, and z and y differ in n components, then x and y cannot differ
in more than m + n components. Hamming distance is used commonly
when the vectors are Boolean; they consist of 0’s and 1’s only. However, the
vectors can have components from any set.

Example: The Hamming distance between the vectors 10101 and 11110 is 3.
That is, these vectors differ in the second, fourth, and fifth components,
while they agree in the first and third components.

3.6 Recommendation System

Definition:
The software system that determines which items should be shown to a
particular visitor within the webpage or website is known as recommender
system.
Every recommender system must develop and maintain a user model or user
profile that contains the user’s preferences. The existence of a user model is
central to every recommender system, but the way in which this information
is acquired and exploited depends on the particular recommendation
technique.
User preferences can be acquired implicitly by monitoring user behavior, but
the recommender system explicitly ask the visitor about his or her
preferences.
The additional information, the system should exploit when it generates a list
of personalized recommendations are the behavior, opinions, and tastes of a
large community of other users into account. These systems are often
referred to as community based or collaborative approaches.

Types of Recommendation System

1. Collaborative Recommendation
1. Content Based Recommendation
2. Knowledge Based Recommendation
3. Hybrid Recommendation

3.7 Collaborative Recommendation


The basic idea of these systems is that if users shared the same interests in
the past, they will also have similar tastes in the future.
The main idea of collaborative recommendation approaches is to exploit
information about the past behavior or the opinions of an existing user
community for predicting which items the current user of the system will
most probably like or be interested in.
This type of recommendation system is used as a tool in online retail sites to
customize the content to the needs of a particular customer and to thereby
promote additional items and increase sales.
Collaborative Recommendation
https://www.youtube.com/watch?v=wTZUHiLUY94
Typical questions that arise in the context of collaborative approaches include
the following:
1.How do we find users with similar tastes to the user for whom we need a
recommendation?
2.How do we measure similarity?
3.What should we do with new users, for whom a buying history is not yet
available?
4.How do we deal with new items that nobody has bought yet?
5.What if we have only a few ratings that we can exploit?
6.What other techniques besides looking for similar users can we use for
making a prediction about whether a certain user will like an item?

Pure collaborative approaches take a matrix of given user–item ratings as the


only input and produce the following types of output: (a) a (numerical)
prediction indicating to what degree the current user will like or dislike a
certain item and (b) a list of n recommended items. Such a top-N list should
not contain items that the current user has already bought.

3.7.1 User-based nearest neighbor recommendation

Given a ratings database and the ID of the current (active) user as an input,
identify other users (referred to as peer users or nearest neighbors) that had
similar preferences to those of the active user in the past. Then, for every
product p that the active user has not yet seen, a prediction is computed
based on the ratings for p made by the peer users.
The assumptions of this type of methods are that
(a) if users had similar tastes in the past they will have similar tastes in the
future.
(b) user preferences remain stable and consistent over time.
Example:

A database of ratings of the current user, Alice, and some other users is
shown in the below table.

Alice has rated “Item1” with a “5” on a 1-to-5 scale, which means that she
strongly liked this item. The task of a recommender system in this example
is to determine whether Alice will like or dislike “Item5”, which Alice has not
yet rated or seen.
If we can predict that Alice will like this item very strongly, we should include
it in Alice’s recommendation list. For this, we search for users whose taste is
similar to Alice’s and then take the ratings of this group for “Item5” to
predict whether Alice will like this item.

Let U = {u1,..., un} to denote the set of users, P = {p1,..., pm} for the set of
products (items), and R as an n × m matrix of ratings ri,j with i ∈ 1 ... n, j ∈
1 ... m. The possible rating values are defined on a numerical scale from 1
(strongly dislike) to 5 (strongly like).
If a certain user i has not rated an item j, the corresponding matrix entry ri,j
remains empty.
With respect to the determination of the set of similar users, the measure
used in recommender systems is Pearson’s correlation coefficient.
The similarity sim(a, b) of users a and b, given the rating matrix R, is defined
by

.The symbol corresponds to the average rating of user a.

= 2 / (1.414*1.685)
= 2 / 2.383
=0.84
The Pearson correlation coefficient takes values from +1 (strong positive
correlation) to −1 (strong negative correlation).
The similarities to the other users, User2, User3 and User4, are 0.70, 0.00,
and −0.79, respectively. Based on these calculations, we observe that User1
and User2 were somehow similar to Alice in their rating behavior in the past.

To make a prediction for Item5, we now have to decide which of the


neighbors’ ratings we shall take into account and how strongly we shall value
their opinions. In this example, the choice would be to take User1 and User2
as peer users to predict Alice’s rating.

The formula for computing a prediction for the rating of user a for item p
that also factors the relative proximity of the nearest neighbors N and a’s
average rating 𝑟̅a is the following:

In the example, the prediction for Alice’s rating for Item5 based on the
ratings of near neighbors User1 and User2 will be
= 4 + ((0.85 ∗ (3 − 2.4) + 0.70 ∗ (5 − 3.8))/ (0.85+0.70)
= 4+ ((0.85*0.6)+(0.70*1.2)) / 1.55
= 4+ ((0.51+0.84))/1.55 =4+(1.35/1.55) = 4+0.87 = 4.87
We can now compute rating predictions for Alice for all items she has not yet
seen and include the ones with the highest prediction values in the
recommendation list. In the example, it will be a good choice to include
Item5 in such a list.
Rating databases are much larger and can comprise thousands or even
millions of users and items, which means that we must think about
computational complexity.
The rating matrix is very sparse, meaning that every user will rate only a
very small subset of the available items. It is unclear what we can
recommend to new users or how we deal with new items for which no
ratings exist.

Better similarity and weighting metrics

Pearson’s correlation coefficient is used to measure the similarity among


users. Other metrics, such as adjusted cosine similarity, Spearman’s rank
correlation coefficient, or the mean squared difference measure were also
used to determine the proximity between users.

For user-based recommender systems, the Pearson coefficient outperforms


other measures of comparing users. For item-based recommendation
techniques, the cosine similarity measure consistently outperforms the
Pearson correlation metric.

Various improvements in the weighting metrics has been developed such as


applying a transformation function to the item ratings, which reduces the
relative importance of the agreement on universally liked items. This
technique is also called as inverse user frequency, which address the same
problem through a variance weighting factor that increases the influence of
items that have a high variance in the ratings – that is, items on which
controversial opinions exist.
Significance weighting scheme is based on a linear reduction of the similarity
weight when there are fewer than fifty co-rated items, the increases in
prediction accuracy are significant.
Another method for improving the accuracy of the recommendations is by
fine-tuning the prediction weights is termed as case amplification. Case
amplification refers to an adjustment of the weights of the neighbors in a
way that values close to +1 and −1 are emphasized by multiplying the
original weights by a constant factor ρ.

Neighborhood selection

In the example, we decided not to take all neighbors into account


(neighborhood selection). For the calculation of prediction, we included only
those that had a positive correlation with the active user. If we include all
users in the neighborhood, this would not only negatively influence the
performance with respect to the required calculation time, but it would also
have an effect on the accuracy of the recommendation.

The common techniques for reducing the size of the neighborhood are to
define a specific minimum threshold of user similarity or to limit the size to a
fixed number and to take only the k nearest neighbors into account. If the
similarity threshold is too high, the size of the neighborhood will be very
small for many users, which in turn means that for many items no
predictions can be made (Reduced Coverage). In contrast, when the
threshold is too low, the neighborhood sizes are not significantly reduced.
The value chosen for k – the size of the neighborhood – does not influence
coverage. However, the problem of finding a good value for k still exists:
When the number of neighbors k taken into account is too high, too many
neighbors with limited similarity bring additional “noise” into the predictions.
When k is too small – for example, below 10, the quality of the predictions
may be negatively affected. In real-world situations, a neighborhood of 20
to 50 neighbors seems reasonable.

3.7.2 Item-based nearest neighbor recommendation

Although user-based CF approaches have been applied successfully in


different domains, some serious challenges remain when it comes to large
E-commerce sites, on which we must handle millions of users and catalog
items. The need to scan a vast number of potential neighbors makes it
impossible to compute predictions in real time.
Large-scale E-commerce sites, implement a different technique, item-based
recommendation, which is more apt for offline preprocessing and thus allows
for the computation of recommendations in real time even for a very large
rating matrix.

The main idea of item-based algorithms is to compute predictions using the


similarity between items and not the similarity between users.
Example:
Let us examine ratings database again and make a prediction for Alice for
Item5. Compare the rating vectors of the other items and look for items that
have ratings similar to Item5. In the example, the ratings for Item5 (3, 5, 4, 1)
are similar to the ratings of Item1 (3, 4, 3, 1) and there is also a partial
similarity with Item4 (3, 3, 5, 2).

The idea of item-based recommendation is to look at Alice’s ratings for these


similar items. Alice gave a “5” to Item1 and a “4” to Item4. An item-based
algorithm computes a weighted average of these other ratings and will predict
a rating for Item5 somewhere between 4 and 5.

The cosine similarity measure

The metric measures the similarity between two n-dimensional vectors based
on the angle between them. This measure is commonly used in the fields of
information retrieval and text mining to compare two text documents, in which
documents are represented as vectors.

The similarity between two items ‘a’ and ‘b’ is viewed as the corresponding
rating vectors 𝑎⃗ and 𝑏⃗ and is defined as follows:

The · symbol is the dot product of vectors. | 𝑎⃗ | is the Euclidian length of the
vector, which is defined as the square root of the dot product of the vector
with itself.
The cosine similarity of Item5 and Item1 is calculated as follows:

=9+20+12+1/(√51*√35
=42/(7.14*5.92)
= 42/42.27
=0.99
The possible similarity values are between 0 and 1, where values near to 1
indicate a strong similarity.

The basic cosine measure does not take the differences in the average rating
behavior of the users into account. This problem is solved by using the
adjusted cosine measure, which subtracts the user average from the ratings.
The values for the adjusted cosine measure correspondingly range from −1
to +1, as in the Pearson measure.

Let U be the set of users that rated both items a and b. The adjusted cosine
measure is calculated as follows:

.
The original ratings database is transformed and replace the original rating
values with their deviation from the average ratings.

The adjusted cosine similarity value for Item5 and Item1.

After the similarities between the items are determined we can predict a
rating for Alice for Item5 by calculating a weighted sum of Alice’s ratings for
the items that are similar to Item5. We can predict the rating for user u for a
product p as follows:

.
Preprocessing data for Item-based filtering

The main problem with traditional user-based CF is that, the algorithm does
not scale well for such large numbers of users and catalog items. Given M
customers and N catalog items, in the worst case, all M records containing
up to N items must be evaluated. The idea is to construct in advance the
item similarity matrix that describes the pairwise similarity of all catalog
items. At run time, a prediction for a product p and user u is made by
determining the items that are most similar to i and by building the weighted
sum of u’s ratings for these items in the neighborhood.

The number of neighbors to be taken into account is limited to the number


of items that the active user has rated. As the number of such items is small,
the computation of the prediction can be easily accomplished within the
short time frame allowed in interactive online applications.

With respect to memory requirements, a full item similarity matrix for N


items can have up to N2 entries. Model-based approaches is an option to
exploit only a certain fraction of the rating matrix to reduce the
computational complexity.

Basic techniques include subsampling, which can be accomplished by


randomly choosing a subset of the data or by ignoring customer records that
have only a very small set of ratings or that only contain very popular items.
Implicit and Explicit Ratings

For gathering users’ opinions, explicit item ratings is the most precise
method. In most cases, five-point or seven-point response scales ranging
from “Strongly dislike” to “Strongly like” are used. They are then internally
transformed to numeric values so that the similarity measures can be
applied.

The main problems with explicit ratings are that such ratings require
additional efforts from the users of the recommender system and users
might not be willing to provide such ratings as long as the value cannot be
easily seen. Thus, the number of available ratings could be too small, which
in turn results in poor recommendation quality.

Implicit ratings are typically collected by the web shop or application in


which the recommender system is embedded. When a customer buys an
item, for instance, recommender systems interpret this behavior as a
positive rating. The system could also monitor the user’s browsing behavior.
If the user retrieves a page with detailed item information and remains at
this page for a longer period of time, a recommender could interpret this
behavior as a positive orientation toward the item.

Although implicit ratings can be collected constantly and do not require


additional efforts from the side of the user, one cannot be sure whether the
user behavior is correctly interpreted.
Data sparsity and the cold-start problem

In the rating matrices used in the previous examples, ratings existed for all
but one user–item combination. In real-world applications, the rating
matrices tend to be very sparse, as customers provide ratings for only a
small fraction of the catalog items.

To compute good predictions when there are relatively few ratings available,
the method is to exploit additional information about the users, such as
gender, age, education, interests, or other available information that can help
to classify the user. The set of similar users (neighbors) is based not only on
the analysis of the explicit and implicit ratings, but also on information
external to the ratings matrix.

Graph-based method is one of the approach to deal with cold-start and data
sparsity problems. The main idea of graph-based method is to exploit the
supposed “transitivity” of customer tastes and thereby augment the matrix
with additional information.
Consider the user-item relationship graph which can be inferred from the
binary ratings matrix table.

A 0 in this matrix should not be interpreted as an explicit (poor) rating, but


rather as a missing rating.
Assume that we are looking for a recommendation for User1. When using a
standard CF approach, User2 will be considered a peer for User1 because
they both bought Item2 and Item4. Thus Item3 will be recommended to
User1 because the nearest neighbor, User2, also bought or liked it. This
approach views the recommendation problem as a graph analysis problem,
in which recommendations are determined by determining paths between
users and items.

In a standard user-based or item-based CF approach, paths of length 3 will


be considered . Item3 is relevant for User1 because there exists a three-step
path (User1–Item2–User2–Item3) between them.
Because the number of such paths of length 3 is small in sparse rating
databases, the idea is to consider longer paths (indirect associations) to
compute recommendations. Using path length 5, for instance, would allow
for the recommendation of Item1, as two five-step paths exist that connect
User1 and Item1.
The cold-start problem can be viewed as a special case of this sparsity
problem. The questions here are
(a) how to make recommendations to new users that have not rated any
item yet
(b) how to deal with items that have not been rated or bought yet.
Both problems can be addressed with the help of hybrid approaches – that
is, with the help of additional, external information.
3.8 Content-based Recommendation

An recommender system can accomplish this task only if two pieces of


information are available: a description of the item characteristics and a user
profile that somehow describes the (past) interests of a user, maybe in terms
of preferred item characteristics.
The recommendation task then consists of determining the items that match
the user’s preferences best. This process is called content-based
recommendation.

Although such an approach must rely on additional information about items


and user preferences, it does not require the existence of a large user
community or a rating history – that is, recommendation lists can be
generated even if there is only one single user. The challenging task in this
approach is the acquisition of subjective, qualitative features.

In most of these approaches the basic assumption is that the characteristics


of the items can be automatically extracted from the document content itself
or from unstructured textual descriptions.

Typical examples for content-based recommenders are, systems that


recommend news articles by comparing the main keywords of an article in
question with the keywords that appeared in other articles that the user has
rated highly in the past. The recommendable items are referred to as
“documents”.
Content representation and content similarity

The simplest way to describe catalog items is to maintain an explicit list of


features for each item (also often called attributes, characteristics, or item
profiles).
For a book recommender, one could, for instance, use the genre, the
author’s name, the publisher, or anything else that describes the item and
store this information in a relational database system. When the user’s
preferences are described in terms of his or her interests using exactly this
set of features, the recommendation task consists of matching item
characteristics and user preferences.

Consider the Table in which books are described by characteristics such as


title, genre, author, type, price, or keywords.

A book recommender system can construct Alice’s profile in different ways.


The straightforward way is to explicitly ask Alice, for instance, for a desired
price range or a set of preferred genres. The other way is to ask Alice to
rate a set of items, either as a whole or along different dimensions.
In the example, the set of preferred genres could be defined manually by
Alice, whereas the system may automatically derive a set of keywords from
those books that Alice liked, along with their average price.
The set of keywords corresponds to the set of all terms that appear in the
document.

To make recommendations, content-based systems work by evaluating how


strongly a not-yet-seen item is “similar” to items the active user has liked in
the past.
Given an unseen book B, the system could check whether the genre of the
book at hand is in the list of Alice’s preferred genres. Similarity in this case is
either 0 or 1.
Another option is to calculate the similarity or overlap of the involved
keywords. Similarity metric that is suitable for multivalued characteristics, is
the Dice coefficient and defined as follows:
If every book Bi is described by a set of keywords keywords(Bi), the Dice
coefficient measures the similarity between books bi and bj as

.
The vector space model and TF-IDF

The information about the publisher and the author are actually not the
“content” but rather additional knowledge about it. The standard approach in
content-based recommendation is, not to maintain a list of “meta-
information” features, but to use a list of relevant keywords that appear
within the document.

The main idea, is that such a list can be generated automatically from the
document content itself or from a free-text description. The content of a
document can be encoded in a keyword list in different ways. In first step,
one could set up a list of all words that appear in all documents and describe
each document by a Boolean vector, where a 1 indicates that a word appears
in a document and a 0 that the word does not appear.

If the user profile is described by a similar list (1 denoting interest in a


keyword), document matching can be done by measuring the overlap of
interest and document content.

TF-IDF is a technique in the field of information retrieval and stands for term
frequency-inverse document frequency. Text documents can be TF-IDF
encoded as vectors in a multidimensional Euclidian space. The space
dimensions correspond to the keywords (also called terms or tokens)
appearing in the documents. The coordinates of a given document in each
dimension (i.e., for each term) are calculated as a product of two
submeasures: term frequency and inverse document frequency.
TF (Term Frequency)

Term frequency describes how often a certain term appears in a document


(assuming that important words appear more often).
Search for the normalized term frequency value TF(i, j) of keyword i in
document j.
Let freq(i, j) be the absolute number of occurrences of i in j.
Given a keyword i, OtherKeywords(i, j) denote the set of the other keywords
appearing in j.
Compute the maximum frequency maxOthers(i, j) as max(freq(z, j )), z ∈
OtherKeywords(i, j ).
Finally, calculate TF(i, j ).

IDF (Inverse Document Frequency)

Inverse document frequency is the measure that is combined with term


frequency. It aims at reducing the weight of keywords that appear very often
in all documents.
The idea is that those generally frequent words are not very helpful to
discriminate among documents, and more weight should therefore be given
to words that appear in only a few documents.

.
Let N be the number of all recommendable documents and n(i) be the
number of documents from N in which keyword i appears.
The inverse document frequency for i is calculated as

The combined TF-IDF weight for a keyword i in document j is computed as


the product of these two measures:

In the TF-IDF model, the document is, therefore, represented not as a


vector of Boolean values for each keyword but as a vector of the computed
TF-IDF weights.

Improving the vector space model/limitations

TF-IDF vectors are large and very sparse. To make them more compact and
to remove irrelevant information from the vector, additional techniques can
be applied.

1.Stop words and stemming:


The stop words are prepositions and articles such as “a”, “the”, or “on”,
which can be removed from the document vectors.
Stemming or conflation, which replace variants of the same word by their
common stem (root word). The word “stemming” would, for instance, be
replaced by “stem”, “went” by “go”, etc.
These techniques further reduce the vector size and help to improve the
matching process in cases in which the word stems are also used in the user
profile. Stemming procedures are implemented as a combination of
morphological analysis using suffix-stripping algorithm. Both the terms
university and universal are stemmed to univers, which may lead to an
unintended match of a document with the user profile.

2.Size cutoffs:
Another method to reduce the size of the document representation and
remove “noise” from the data is to use only the n most informative words. If
too few keywords (say, fewer than 50) were selected, some important
document features were not covered. On the other hand, when including too
many features (e.g., more than 300), keywords are used in the document
model that have limited importance and therefore represent noise that
actually worsens the recommendation accuracy.

3.Phrases:
A possible improvement to represent accuracy is to use “phrases as terms”,
which are more descriptive for a text than single words alone.

Limitations:
The approach of extracting and weighting individual keywords from the text
has important limitation: it does not take into account the context of the
keyword and, in some cases, may not capture the “meaning” of the
description correctly.
3.9 Knowledge-based Recommendation

In Knowledge-based recommender systems, the recommendations are


calculated independently of individual user ratings: either in the form of
similarities between customer requirements and items or on the basis of
explicit recommendation rules.
Two basic types of knowledge-based recommender systems
Constraint based and
Case-based systems
Both approaches are similar in terms of the recommendation process:
the user must specify the requirements, and the system tries to identify a
solution.
If no solution can be found, the user must change the requirements.
The system may also provide explanations for the recommended items.
Constraint-based recommenders rely on an explicitly defined set of
recommendation rules. In constraint-based systems, the set of
recommended items is determined by, searching for a set of items that fulfill
the recommendation rules.
Case-based recommenders focus on the retrieval of similar items on the
basis of different types of similarity measures. Case-based systems, use
similarity metrics to retrieve items that are similar (within a predefined
threshold) to the specified customer requirements.

.
Knowledge representation and reasoning

Knowledge-based systems rely on detailed knowledge about item


characteristics.
A snapshot of an item catalog is shown below for the digital camera domain.

The recommendation problem consists of selecting items from this catalog


that match the user’s needs, preferences, or other requirements. The user’s
requirements can, be expressed in terms of desired values or value ranges
for an item feature, such as “the price should be lower than 300 €” or in
terms of desired functionality, such as “the camera should be suited for
sports photography”.
A constraint-based recommendation problem can, be represented as a
constraint satisfaction problem that can be solved by a constraint solver or in
the form of a conjunctive query that is executed and solved by a database
engine.
Case-based recommendation systems exploit similarity metrics for the
retrieval of items from a catalog.
Constraints

A constraint satisfaction problem (CSP) can be described by a tuple (V,D,C)


where V is a set of variables,
D is a set of finite domains for these variables, and
C is a set of constraints that describes the combinations of values the
variables can simultaneously take .
A solution to a CSP corresponds to an assignment of a value to each variable
in V in a way that all constraints are satisfied.
Example: Constraint-based recommender systems exploit a recommender
knowledge base that includes two different sets of variables (V = VC ∪
VPROD), one describing potential customer requirements and the other
describing product properties.
Three different sets of constraints (C = CR ∪ CF ∪ CPROD) define which items
should be recommended to a customer in which situation.
Examples for such variables and constraints for a digital camera
recommender, are shown in below Table.

.
Customer properties (VC) describe the possible customer requirements. The
customer property max-price denotes the maximum price acceptable for the
customer, the property usage denotes the planned usage of photos (print
versus digital organization), and photography denotes the predominant type
of photos to be taken; categories are, for example, sports or portrait photos.
Product properties (VPROD) describe the properties of products in an
assortment; for ex. mpix denotes possible resolutions of a digital camera.
Compatibility constraints (CR) define allowed instantiations of customer
properties – for example, if large-size photoprints are required, the maximal
accepted price must be higher than 200.
Filter conditions (CF) define under which conditions which products should be
selected – in other words, filter conditions define the relationships between
customer properties and product properties. An example filter condition is
large-size photoprints require resolutions greater than 5 mpix.
Product constraints (CPROD) define the currently available product
assortment. Each conjunction in this constraint completely defines a product
(item) – all product properties have a defined value.

The task of identifying a set of products matching a customer’s wishes and


needs is denoted as a recommendation task.
The customer requirements REQ can be encoded as unary constraints over
the variables in VC and VPROD, for example, max-price = 300. Each solution
to the CSP (V = VC ∪ VPROD,D, C = CR ∪ CF ∪ CPROD ∪ REQ) corresponds to
a consistent recommendation. In many practical settings, the variables in VC
do not have to be instantiated, as the relevant variables are already bound
to values through the constraints in REQ. The task of finding such valid
instantiations for a given constraint problem can be accomplished by
standard constraint solver.
Conjunctive queries

Another way of constraint-based item retrieval for a given catalog, is to view


the item selection problem as a data filtering task. The main task is not to
find valid variable instantiations for a CSP but rather to construct a
conjunctive database query that is executed against the item catalog.
A conjunctive query is a database query with a set of selection criteria that
are connected conjunctively. For example, σ[mpix≥10,price< 300] is a
conjunctive query on the database table P, where σ represents the selection
operator and [mpix ≥ 10, price < 300] is the corresponding selection criteria.
If we exploit conjunctive queries (database queries) for item selection
purposes, VPROD and CPROD are represented by a database table P. Table
attributes represent the elements of VPROD and the table entries represent
the constraint(s) in CPROD.

In example, the set of available items is P = {p1, p2, p3, p4, p5, p6, p7, p8}.
Queries can be defined that select different item subsets from P depending
on the requirements in REQ. Such queries are directly derived from the filter
conditions (CF) that define the relationship between customer requirements
and the corresponding item properties.
For example, the filter condition usage = large-print → mpix > 5.0 denotes
the fact that if customers want to have large photoprints, the resolution of
the corresponding camera (mpix) must be > 5.0. If a customer defines the
requirement usage = large-print, the corresponding filter condition is active,
and the consequent part of the condition will be integrated in a
corresponding conjunctive query.
The existence of a recommendation for a given set REQ and a product
assortment P is checked by querying P with the derived conditions
(consequents of filter conditions). Such queries are defined in terms of
selections on P formulated as σ[criteria](P), for example, σ[mpix≥10](P) =
{p4, p7}.

Cases and Similarities

In case-based recommendation approaches, items are retrieved using


similarity measures that describe to which extent item properties match
some given user’s requirements. The distance similarity of an item p to the
requirements r ∈ REQ is defined by

sim(p, r) expresses for each item attribute value φr(p) its distance to the
customer requirement r ∈ REQ, for example, φmpix(p1) = 8.0. wr is the
importance weight for requirement r.

In real-world scenarios, there are properties a customer would like to


maximize – for example, the resolution of a digital camera. There are also
properties that customers want to minimize – for example, the price of a
digital camera or the risk level of a financial service. In the first case, its
“more is-better” (MIB) properties; in the second case the corresponding
properties are denoted with “less-is-better” (LIB).
To take these basic properties into account in similarity calculations, the
following formulae for calculating local similarities are given by
In case of MIB properties, the local similarity between p and r is calculated
as follows:

The local similarity between p and r in case of LIB properties is calculated as


follows:

There are situations in which the similarity should be based solely on the
distance to the originally defined requirements. For example, if the user has
a certain run time of a financial service in mind or requires a certain monitor
size, the shortest run time as well as the largest monitor will not represent
an optimal solution. For such cases, a third type of local similarity function is
given by:

.
3.10 Hybrid recommendation approaches

A recommendation system is represented as a black box that transforms


input data into a ranked list of items as output. User models and contextual
information, community data, product features, and knowledge models
constitute the potential types of recommendation input.
Hybrid recommender systems are approaches that combine several
algorithm implementations or recommendation component.

There are two dimensions:


Basic recommendation paradigm
System’s hybridization design - method used to combine two or more
algorithms.
Recommendation paradigms

A recommendation problem can be treated as a utility function rec that


predicts the usefulness of an item i in a set of items I for a specific user u
from the universe of all users U and denoted by a function U × I → R. R is
the interval [0 ... 1] representing the possible utility scores of a recommended
item.
The selection task of any recommender system RS is to identify those n items
from a catalog I that achieve the highest utility scores for a given user u:

A recommendation system RS will output a ranked list of the top n items that
are of highest utility for the given user.
The choice of the recommendation paradigm determines the type of input
data that is required. The user model and contextual parameters represent
the user and the specific situation the user is currently in. (address, age, or
education, and contextual parameters such as the season of the year, the
people who will accompany the user when he or she buys or consumes the
item or the current location of the user). All these user and situation-specific
data are stored in the user’s profile. All recommendation paradigms require
access to this user model to personalize the recommendations. However,
depending on the application domain and the usage scenario, only limited
parts of the user model may be available. Recommendation paradigms
selectively require community data, product features, or knowledge models.
Hybridization designs
Three base designs:
Monolithic
parallelized,
pipelined hybrids

Monolithic hybridization design:


Monolithic denotes a hybridization design that incorporates aspects of
several recommendation strategies in one algorithm implementation.
Monolithic hybrids consist of a single recommender component that
integrates multiple approaches by preprocessing and combining several
knowledge sources. Hybridization is achieved by a built-in modification of the
algorithm behavior to exploit different types of input data. Data-specific
preprocessing steps are used to transform the input data into a
representation that can be exploited by a specific algorithm paradigm.
Several recommenders contribute virtually because the hybrid uses
additional input data that are specific to another recommendation algorithm,
or the input data are augmented by one technique and factually exploited by
the other. Both feature combination and feature augmentation strategies can
be assigned to this category.

.
Feature combination hybrids

A feature combination hybrid is a monolithic recommendation component


that uses a diverse range of input data. A feature combination hybrid
combines collaborative features, such as a user’s likes and dislikes, with
content features of catalog items.
Example: In book domain, the below table presents several users’ unary
ratings of a product catalog. Ratings constitute implicitly observed user
feedback, such as purchases. Product knowledge is restricted to an item’s
genre. In a pure collaborative approach, without considering any product
features, both User1 and User2 would be judged as being equally similar to
Alice. However, the feature-combination approach identifies new hybrid
features based on community and product data.

A hybrid encoding of the information contained is shown in the below Table.


These features are derived by the following rules. If a user bought mainly
books of genre X (i.e., two-thirds of the total purchases and at least two
books), the user characteristic to User likes many X books is set to true. We
also set User likes some X books to true, requiring one-third of the user’s
purchases, or at least one item in absolute numbers.
Initially Alice seems to possess similar interests to User1 and User2, but the
picture changes after transforming the user/item matrix. It actually turns
out that User2 behaves in a very indeterminate manner by buying some
books of all three genres. In contrast, User1 seems to focus specifically on
mystery books, as Alice does.

Another approach for feature combination exploit different types of rating


feedback based on their predictive accuracy and availability. The below Table
depicts a scenario in which several types of implicitly or explicitly collected
user feedback are available as unary ratings, such as navigation actions
Rnav, click-throughs on items’ detail pages Rview, contextual user
requirements Rctx , or actual purchases Rbuy. These categories differentiate
themselves by their availability and their aptness for making predictions. For
instance, navigation actions and page views of online users occur frequently,
whereas purchases are less common. In addition, information about the
user’s context, such as the keywords used for searching or input to a
conversational requirements elicitation dialog, is a source for computing
recommendations.
In contrast, navigation actions typically contain more noise, particularly
when users randomly surf and explore a shop. Therefore, rating categories
may be prioritized – for example (Rbuy, Rctx ) ≺ Rview ≺ Rnav, where
priority decreases from left to right. When interpreting all rating data in
below Table, User2 and User3 seem to have most in common with Alice, as
both share at least three similar ratings.

However, the algorithm for neighborhood uses the given precedence rules
and thus initially uses Rbuy and Rctx as rating inputs to find similar peers.
User1 would be identified as being most similar to Alice. The similar peers
can be determined with satisfactory confidence, the feature combination
algorithm includes additional, lower-ranked rating input.
A similar principle for feature combination uses demographic user
characteristics to bootstrap a collaborative recommender system when not
enough item ratings were known. Because of the simplicity of feature
combination approaches, many current recommender systems combine
collaborative and/or content-based features in one way or another.
Feature augmentation hybrids

Feature augmentation is another monolithic hybridization design that may be


used to integrate several recommendation algorithms. This hybrid does not
simply combine and preprocess several types of input, but rather applies
more complex transformation steps. The output of a contributing
recommender system augments the feature space of the actual
recommender by preprocessing its knowledge sources. Content-boosted
collaborative filtering is an example and it predicts a user’s assumed rating
based on a collaborative mechanism that includes content-based predictions.

The below table presents an example user/item matrix, together with rating
values for Item5 (VUser,Item5), the Pearson correlation coefficient
Palice,User signifying the similarity between Alice and the respective users.
It denotes the number of user ratings (nUser) and the number of
overlapping ratings between Alice and the other users (nAlice,User).

.
The rating matrix for this example is complete, because it consists not only
users’ actual ratings ru,i, but also content-based predictions cu,i in the case
that a user rating is missing.
Create a pseudo-user-ratings vector vu,i:

Based on these pseudo ratings, the algorithm computes predictions in a


second step. However, depending on the number of rated items and the
number of co-rated items between two users, weighting factors may be used
to adjust the predicted rating value for a specific user–item pair a, i, and
given by the equation.

.
The hybrid correlation weight (hwa,u) adjusts the computed Pearson
correlation based on a significance weighting factors ga,u that favors peers
with more co-rated items. When the number of co-rated items is greater
than 50, the effect tends to level off.
In addition, the harmonic mean weighting factor (hma,u) reduces the
influence of peers with a low number of user ratings, as content-based
pseudo-ratings are less reliable if derived from a small number of user
ratings. The self-weighting factor swi reflects the confidence in the
algorithm’s content-based prediction, which also depends on the number of
original rating values of any user i. The constant max was set to 2.

Thus, assuming that the content-based prediction for Item5 is


cAlice,Item5 = 3, reccbcf (Alice, Item5) is computed as follows:

swa=40/50*2 = 1.6, Cu,i=3


Hma,u=2*mamu/ma+mu = (2*0.8*0.28)/(0.28+0.8)
= 0.448/1.08=0.414
Hma,u=2*mamu/ma+mu=(2*0.8*1)/(0.8+1)
=1.6/1.8=0.88
hwa,u=sga,u+hma,u=0.12+0.414=0.535
hwa,u=sga,u+hma,u=0.56+0.88=1.45

The predicted value (on a 1–5 scale) indicates that Alice will not be euphoric
about Item5, although the most similar peer, User1, rated it with a 4.
However, the weighting and adjustment factors place more emphasis on
User 2 and the content-based prediction.
Parallelized Designs

Parallelized designs require at least two separate recommender


implementations, which are consequently combined. Based on their input,
parallelized hybrid recommender systems operate independently of one
another and produce separate recommendation lists. In hybridization step,
their output is combined into a final set of recommendations. The weighted,
mixed, and switching strategies require recommendation components to
work in parallel.

Parallelized Designs – Mixed Hybrids


A mixed hybridization strategy combines the results of different
recommender systems at the level of the user interface, in which results
from different techniques are presented together. Therefore the
recommendation result for user u and item i of a mixed hybrid strategy is the
set of tuples (score, k) for each of its n constituting recommenders reck :

.
The top-scoring items for each recommender are then displayed to the user
next to each other.
Another form of a mixed hybrid is described, which merges the results of
several recommendation systems. It proposes bundles of recommendations
from different product categories in the tourism domain, in which for each
category a separate recommender is employed. A recommended bundle
consists, for instance, of accommodations, as well as sport and leisure
activities, that are derived by separate recommender systems. A Constraint
Satisfaction Problem (CSP) solver is employed to resolve conflicts, thus
ensuring that only consistent sets of items are bundled together according to
domain constraints such as “activities and accommodations must be within
50 km of each other”.

Parallelized Designs – Weighted Hybrids

A weighted hybridization strategy combines the recommendations of two or


more recommendation systems by computing weighted sums of their scores.
Given n different recommendation functions reck with associated relative
weights βk:

where item scores are restricted to the same range for all recommenders
and
This technique combines the predictive power of different recommendation
techniques in a weighted manner.
Consider an example in which two recommender systems are used to
suggest one of five items for a user Alice. These recommendation lists are
hybridized by using a uniform weighting scheme with β1 = β2 = 0.5. The
weighted hybrid recommender recw thus produces a new ranking by
combining the scores from rec1 and rec2.
=(0.5*0.5)+(0.8*0.5)=0.25+0.4=0.65
= 0+(0.9*0.5)=0.45
=(0.3*0.5)+(0.4*0.5)=0.15+0.2=0.35
Items that are recommended by only one of the two users, such as Item2,
may still be ranked highly after the hybridization step.

If the weighting scheme is to remain static, the involved recommenders


must also produce recommendations of the same relative quality for all user
and item combinations. To estimate weights, an empirical bootstrapping
approach can be taken. A sensitivity analysis is conducted between a
collaborative and a knowledge-based recommender to identify the optimum
weighting scheme. The P-Tango system is another example of such a
system, blending the output of a content-based and a collaborative
recommender in the news domain. This approach uses a dynamic weighting
scheme and it dynamically adjusts the relative weights for each user to
minimize the predictive error in cases in which user ratings are available.
Let us assume that Alice purchased Item1 and Item4, which we will interpret
as positive unary ratings. We thus require a weighting that minimizes a goal
metric such as the mean absolute error (MAE) of predictions of a user for her
rated items R.

=0.1*(0.5-1)+0.9*(0.8-1) = 0.1*0.5+0.9*0.2 = 0.05+0.18=0.23


= 0.1*(0.1-1)+0.9*(0-1) = 0.1*0.9+0.9*1 = 0.09+0.9=0.99
MAE=(0.23+0.99)/2 = 1.22/2= 0.61
If we interpret Alice’s purchases as very strong relevance feedback for her
interest in an item, then the set of Alice’s actual ratings R will contain r1 = r4
= 1.
The below Table summarizes the absolute errors of rec1 and rec2’s predictions
shown in previous table for different weighting parameters and this table
shows that the MAE improves as rec2 is weighted more strongly.

However, it is evident that the weight assigned to rec1 should be


strengthened. Both rated items are ranked higher by rec1 than by rec2 –
Item1 is first instead of second and Item4 is third, whereas rec2 does not
recommend Item4 at all.
When applying a weighted strategy, it is ensured that the involved
recommenders assign scores on comparable scales or apply a transformation
function beforehand.
The assignment of dynamic weighting parameters stabilizes as more rated
items are made available by users. The error metrics, such as mean squared
error or rank metrics, can be explored. Mean squared error puts more
emphasis on large errors, whereas rank metrics focus not on
recommendation scores but on ranks.

Switching hybrids

Switching hybrids decides which recommender should be used in a specific


situation, depending on the user profile and/or the quality of
recommendation results. The evaluation could be carried out as follows:

where k is determined by the switching condition.


To overcome the cold-start problem, a knowledge-based and collaborative
switching hybrid could initially make knowledge-based recommendations
until enough rating data are available. When the collaborative filtering
component can deliver recommendations with sufficient confidence, the
recommendation strategy could be switched. A switching strategy is
developed that actually switches between two hybrid variants of
collaborative filtering and knowledge-based recommendation. If the first
algorithm, a cascade hybrid, delivers fewer than n recommendations, the
hybridization component switches to a weighted variant as a fallback
strategy.
Pipelined hybridization design

When several recommender systems are joined together in a pipeline


architecture, the output of one recommender becomes part of the input of
the subsequent one. Pipelined hybrids implement a staged process in which
several techniques sequentially build on each other before the final one
produces recommendations for the user. The pipelined hybrid variants
differentiate themselves mainly according to the type of output they produce
for the next stage. In other words, a preceding component may either
preprocess input data to build a model that is exploited by the subsequent
stage or deliver a recommendation list for further refinement.
The Cascade and meta-level hybrids, are examples of such pipeline
architectures.

Cascade hybrids

Cascade hybrids are based on a sequenced order of techniques, in which


each succeeding recommender only refines the recommendations of its
predecessor. The recommendation list of the successor technique is thus
restricted to items that were also recommended by the preceding technique.
Assume a sequence of n techniques, where rec1 represents the
recommendation function of the first technique and recn the last one. The
final recommendation score for an item is computed by the nth technique.
However, an item will be suggested by the kth technique only if the (k −
1)th technique also assigned a nonzero score to it. This applies to all k ≥ 2
by induction.

In cascade hybrid, all techniques except the first one, can only change the
ordering of the list of recommended items from their predecessor or exclude
an item by setting its utility to 0. Thus cascading strategies does not reduce
the size of the recommendation set as each additional technique is applied.
As a consequence, cascade algorithms do not deliver the required number of
propositions, thus decreasing the system’s usefulness. Therefore cascade
hybrids may be combined with a switching strategy to handle the case in
which the cascade strategy does not produce enough recommendations.

Meta-level hybrids

In meta-level hybridization design, one recommender builds a model that is


exploited by the principal recommender to make recommendations. The nth
recommender exploits a model that has been built by its predecessor. n has
always been 2.
A meta-level hybridization combines collaborative filtering with knowledge-
based recommendation. The hybrid generates binary user preferences of the
form a → b, where a represents a user requirement and b a product feature.
For example, a user of the cigar advisor is looking for a gift and finally
bought a cigar of the brand Montecristo, then the constraint for whom =
“gift” → brand = “Montecristo” becomes part of the user’s profile. When
computing a recommendation, a collaborative filtering step retrieves all such
constraints from a user’s peers.
A knowledge-based recommender finally applies these restrictions to the
product database and derives item propositions. An evaluation showed that
this approach produced more successful predictions than a manually crafted
knowledge base or an impersonalized application of all generated
constraints.

Lecture Notes
https://drive.google.com/file/d/1RfzP6WM_ZTlXN1SEwgO6AEWdNlrvsAOb/vi
ew?usp=sharing

.
Assignments
Q. Question CO K Level
No. Level

Find the L1 and L2 distances between the points


(5, 6, 7) and (8, 2, 4).
Find the Jaccard distances between the following
pairs of {1, 2, 3, 4} and {2, 3, 4, 5} set.
1 Compute the cosines of the angles between each
of the following pairs of vectors. (3,−1, 2) and
CO3 K3
(−2, 3, 1).
Find the edit distances (using only insertions and
deletions) between the following pairs of strings.
abcdef and bdaefc.
Find the Hamming distances between each pair
of the following vectors: 000000, 110011.

A database has five transactions. Let min sup =


2 60% and min conf=80%
TID ITEMS
T100 Milk, Onion, Nuts, Kiwi, Egg, Yoghurt
T200 Dhal, Onion, Nuts, Kiwi, Egg, Yoghurt
CO3 K4
T300 Milk, Apple, Kiwi, Egg
T400 Milk, Curd, Kiwi, Yoghurt
T500 Curd, Onion, Kiwi, Ice cream, Egg
Find all frequent item sets using Apriori Algorithm
Part-A Questions and Answers
1. Define Association rule. (CO3, K2)
Association rule is an unsupervised and descriptive learning method used to
discover interesting relationships hidden in a large dataset. The disclosed
relationships can be represented as rules or frequent itemsets. Association
rules are commonly used for mining transactions in databases. Each of the
uncovered rules is in the form X → Y, meaning that when item X is observed,
item Y is also observed.
.
2. What is meant by Market basket analysis? (CO3, K2)
Association rules are sometimes referred to as market basket analysis. Each
transaction can be viewed as the shopping basket of a customer that
contains one or more items. This is also known as an itemset.

3. Define Itemset. (CO3, K2)


The term itemset refers to a collection of items or individual entities that
contain some kind of relationship. This could be a set of retail items
purchased together in one transaction, a set of hyperlinks clicked on by one
user in a single session, or a set of tasks done in one day. An itemset
containing k items is called a k-itemset. K-itemset can be denoted as {item
1,item 2,. . . item k.}.

4. Define Support. (CO3, K2)


One major component of Apriori is support. Given an itemset L, the support
of L is the percentage of transactions that contain L. For example, if 80% of
all transactions contain itemset {bread}, then the support of {bread} is 0.8.
5. What is meant by frequent itemset? (CO3, K2)
A frequent itemset has items that appear together often enough. The term
“often enough” is formally defined with a minimum support criterion. If the
minimum support is set at 0.5, any itemset can be considered a frequent
itemset if at least 50% of the transactions contain this itemset. The support
of a frequent itemset should be greater than or equal to the minimum
support.

6. Define downward closure property. (CO3, K2)


If an itemset is considered frequent, then any subset of the frequent itemset
must also be frequent. This is referred to as the Apriori property (or
downward closure property). For example, if 60% of the transactions contain
{bread,jam}, then at least 60% of all the transactions will contain {bread} or
{jam}.

7. Define apiori algorithm. (CO3, K2)


The Apriori algorithm takes a bottom-up iterative approach to uncovering the
frequent itemsets by first determining all the possible items (or 1-itemsets,
for example {bread}, {eggs}, {milk}, …) and then identifying which among
them are frequent.

8. What is meant by pruning in Association rule? (CO3, K2)


Assuming the minimum support threshold (or the minimum support criterion)
is set at 0.5, the algorithm identifies and retains those itemsets that appear
in at least 50% of all transactions and discards (or “prunes away”) the
itemsets that have a support less than 0.5 or appear in fewer than 50% of
the transactions.
9. How will you evaluate the candidate rules? (CO3, K3)
The measures such as confidence, lift, and leverage can help evaluate the
appropriateness of these candidate rules.

10. Define Confidence. (CO3, K2)


Confidence is defined as the measure of certainty or trustworthiness
associated with each discovered rule. Mathematically, confidence is the
percent of transactions that contain both X and Y out of all the transactions
that contain X.

11. Define Lift. (CO3, K2)


Lift measures how many times more often X and Y occur together than
expected if they are statistically independent of each other. Lift is a measure
of how X and Y are really related rather than coincidentally happening
together.

Lift is 1 if X and Y are statistically independent of each other. In contrast, a


lift of X → Y greater than 1 indicates that there is some usefulness to the
rule. A larger value of lift suggests a greater strength of the association
between X and Y.
12. Define Leverage. (CO3, K2)
Leverage measures the difference in the probability of X and Y appearing
together in the dataset compared to what would be expected if X and Y were
statistically independent of each other.

Leverage is 0 when X and Y are statistically independent of each other. If X


and Y have some kind of relationship, the leverage would be greater than zero.
A larger leverage value indicates a stronger relationship between X and Y.

13. Mention the applications of Association Rules. (CO3, K2)


Various applications are
Broad-scale approaches to better merchandising
Cross-merchandising between products and high-margin or high-ticket items
Physical or logical placement of product within related categories of products
Promotional programs—multiple product purchase incentives managed through
a loyalty card program

14. Define Recommender System. (CO3, K2)


Many online service providers such as Amazon and Netflix use recommender
systems. Recommender systems can use association rules to discover related
products or identify customers who have similar interests.

15. Define Clickstream analysis. (CO3, K2)


Clickstream analysis refers to the analytics on data related to web browsing
and user clicks. This analytics provides valuable insight on how to better
personalize and recommend the content to site visitors.
16. What is the name of the function in R support Apriori
algorithm? (CO3, K2)
The apriori() function from the arule package implements the Apriori
algorithm to create frequent itemsets in R Programming.

17. Define Distance measure. (CO3, K2)


Suppose we have a set of points, called a space. A distance measure on this
space is a function d(x, y) that takes two points in the space as arguments
and produces a real number, and satisfies the following axioms:
d(x, y) ≥ 0 (no negative distances).
d(x, y) = 0 if and only if x = y (distances are positive, except for the
distance from a point to itself).
d(x, y) = d(y, x) (distance is symmetric).
d(x, y) ≤ d(x, z) + d(z, y) (the triangle inequality).

18. Define Euclidean distance. (CO3, K2)


The most common notion of distance is the Euclidean distance in an n-
dimensional space. This distance, sometimes called the L2-norm, is the
square root of the sum of the squares of the differences between the points
in each dimension.

Another distance suitable for Euclidean spaces, called Manhattan distance or


the L1-norm is the sum of the magnitudes of the differences between the
points in each dimension.
19. Define cosine similarity. (CO3, K2)
The similarity between two items a and b is viewed as the corresponding
rating vectors formally defined as follows:

The · symbol is the dot product of vectors. is the Euclidian length of the
vector, which is defined as the square root of the dot product of the vector
with itself.

20. Define Jaccard similarity. (CO3, K2)


The Jaccard similarity of sets is the ratio of the size of the intersection of the
sets to the size of the union. This measure of similarity is suitable for many
applications, including textual similarity of documents and similarity of buying
habits of customers.
The Jaccard similarity of sets S and T is | S ∩ T | / |S ∪ T |
Example
There two sets S and T. There are three elements in their intersection and a
total of eight elements that appear in S or T or both. Thus, SIM(S, T ) =3/8.

21. Define Jaccard distance. (CO3, K2)


One minus the Jaccard similarity is a distance measure, called the Jaccard
distance.
d(x, y) = 1 − SIM(x, y)
22. Define Cosine Distance. (CO3, K2)
The angle between vectors in a vector space is the cosine distance measure.
We can compute the cosine of that angle by taking the dot product of the
vectors and dividing by the lengths of the vectors.

23. Define Edit distance. (CO3, K2)


This distance measure applies to a space of strings, and is the number of
insertions and/or deletions needed to convert one string into the other. The
edit distance can also be computed as the sum of the lengths of the strings
minus twice the length of the longest common subsequence of the strings.

24. Define Hamming Distance. (CO3, K2)


This distance measure applies to a space of vectors. The Hamming distance
between two vectors is the number of positions in which the vectors differ.

25. Consider the two-dimensional Euclidean space and the points


(2, 7) and (6, 4). (CO3, K3)
The L2-norm gives a distance of
The L1-norm gives a distance of |2 − 6| + |7 − 4| = 4 + 3 = 7.
The L∞-norm gives a distance of max (|2 − 6|, |7 − 4|) = max(4, 3) = 4
26. Cosine similarity. (CO3, K3)
Let the two vectors be x = [1, 2,−1] and y = [2, 1, 1].
The dot product x.y is 1 × 2 + 2 × 1 + (−1) × 1 = 3.
The L2-norm of both vectors is √6. x has L2-norm
Thus, the cosine of the angle between x and y is 3/(√6√6) or 1/2. The angle
whose cosine is ½ is 60 degrees, so that is the cosine distance between x
and y.

27. Edit distance. (CO3, K3)


The edit distance between the strings x = abcde and y = acfdeg is 3.
To convert x to y:
1. Delete b.
2. Insert f after c.
3. Insert g after e.
No sequence of fewer than three insertions and/or deletions will convert x to y.
Thus, d(x, y) = 3.

28. Edit Distance (CO3, K3)


The strings x = abcde and y = acfdeg have a unique LCS, which is acde.
The length of x is 5, the length of y is 6, and the length of their LCS is 4. The
edit distance is thus 5 + 6 − 2 × 4 = 3

29. Hamming Distance (CO3, K3)


The Hamming distance between the vectors 10101 and 11110 is 3. That is,
these vectors differ in the second, fourth, and fifth components, while they
agree in the first and third components.
30. Define Collaborative recommendation. (CO3, K2)
Collaborative recommendation systems focus on the relationship between
users and items. Similarity of items is determined by the similarity of the
ratings of those items by the users who have rated both items.

31. Define Content based recommendation. (CO3, K2)


Content-Based systems focus on properties of items. Similarity of items is
determined by measuring the similarity in their properties.

32. What is user based nearest neighbor recommendation? (CO3,


K2)
Given a ratings database and the ID of the current (active) user as an input,
identify other users (sometimes referred to as peer users or nearest
neighbors) that had similar preferences to those of the active user in the
past. Then, for every product p that the active user has not yet seen, a
prediction is computed based on the ratings for p made by the peer users.

33. What is item based nearest neighbor recommendation? (CO3,


K2)
The main idea of item-based algorithms is to compute predictions using the
similarity between items and not the similarity between users.

34. What are the applications of recommendation system? (CO3,


K2)
Product Recommendations
Movie Recommendations
News Articles
35. Define knowledge based recommendation system. (CO3, K2)
In knowledge based recommendation system, users are offered some items
based on knowledge about the users, items and their relationships.
Constraint and case based are the two types in knowledge based
recommendation system.

36. Define hybrid recommendation system. (CO3, K2)


It combines best features of two or more recommendation techniques in to
one to achieve higher performance and overcome the drawbacks of tradition
recommendation techniques. Netflix is the best example of hybrid
recommendation system.
Part-B Questions
Q. Questions CO K Level
No. Level

1 Explain in detail about Association rule mining. CO3 K3

Explain in detail about Apriori algorithm for K3


2 CO3
mining frequent item sets with an example.

Explain in detail about Frequent item set CO3 K3


3
generation in association rule mining.
Explain in detail about evaluation of candidate CO3
4 K2
rules.
Discuss in detail about measures of CO3 K3
5 significance and interestingness for association
rules.
Discuss in detail various distance measures to CO3 K3
6
find the similarities between different entities?
Explain Collaborative filtering based CO3 K2
7
recommendation system.
CO3 K2
8 Explain Content based recommendation system.

Explain in detail about various knowledge CO3 K2


9
based recommendation systems.
Discuss in detail about hybrid recommendation CO3 K2
10
systems.
Explain in detail about vector space model and
CO3
11 TF-IDF in content based recommendation K2

systems.

12 Discuss in detail about Graph analytics. CO3 K4


Supportive Online Courses

Sl. Courses Platform


No.
1 Big Data Computing Swayam

2 Python for Data Science Swayam

3 Applied Machine learning for Python Coursera

4 R Programming Coursera
REAL TIME APPLICATIONS IN DAY TO DAY LIFE

Working of Recommender Systems

Title : How Recommender Systems Work (Netflix/Amazon)


Description : During the last few decades, with the rise of Youtube,
Amazon, Netflix and many other such web services, recommender systems
have taken more and more place in our lives. From e-commerce to online
advertisement, recommender systems are today unavoidable in our daily
online journeys.
In a very general way, recommender systems are algorithms aimed at
suggesting relevant items to users . Recommender systems are really
critical in some industries as they can generate a huge amount of income
when they are efficient or also be a way to stand out significantly from
competitors.
Watch this you-tube video. This video explains the techniques behind
different recommendation systems with real time applications.
https://www.youtube.com/watch?v=n3RKsY2H-NE
Food Recommendations @ Swiggy

Title : Our experiments with food recommendations


@Swiggy - nitin hardeniya
Description : Food is a very personal choice. Swiggy are obsessed
about Customer Experience and want to make food discovery on
the platform seamless and a delight for the consumer. So when
you fire the Swiggy app, with the help of consumers
Implicit/explicit feedback to figure out Your Taste Preferences, Your
Price Affinity, Single/Group Order, Breakfast/ Late night Cravings
and provide a convenient, Simple but highly personalized food
ordering experience.
Watch this you-tube video, the food recommendations experience
shared by
Nitin - Senior Data Scientist @Swiggy.
https://www.youtube.com/watch?v=fRXWHJlgmHA
Content Beyond the Syllabus
The two primary drawbacks of the Apriori Algorithm are:-
1. Apriori needs a generation of candidate itemsets. These itemsets may
be large in number if the itemset in the database is huge.
2. Apriori needs multiple scans of the database to check the support of
each itemset generated and this leads to high costs.
These two properties make the algorithm slower. To overcome these
redundant steps, a new association-rule mining algorithm was developed
named Frequent Pattern Growth Algorithm.

Frequent Pattern Growth Algorithm


This algorithm is an improvement to the Apriori method. A frequent
pattern is generated without the need for candidate generation. FP growth
algorithm represents the database in the form of a tree called a frequent
pattern tree or FP tree.
This tree structure will maintain the association between the itemsets. The
database is fragmented using one frequent item. This fragmented part is
called “pattern fragment”. The itemsets of these fragmented patterns are
analyzed. Thus with this method, the search for frequent itemsets is
reduced comparatively.
FP Tree
Frequent Pattern Tree is a tree-like structure that is made with the initial
itemsets of the database. The purpose of the FP tree is to mine the most
frequent pattern. Each node of the FP tree represents an item of the
itemset. The root node represents null while the lower nodes represent the
itemsets. The association of the nodes with the lower nodes is that the
itemsets with the other itemsets are maintained while forming the tree.
Content Beyond the Syllabus
Frequent Pattern Algorithm Steps
The frequent pattern growth method find the frequent pattern without
candidate generation.
1) The first step is to scan the database to find the occurrences of the
itemsets in the database. This step is the same as the first step of Apriori.
The count of 1-itemsets in the database is called support count or
frequency of 1-itemset.
2) The second step is to construct the FP tree. For this, create the root of
the tree. The root is represented by null.
3) The next step is to scan the database again and examine the
transactions. Examine the first transaction and find out the itemset in it.
The itemset with the max count is taken at the top, the next itemset with
lower count and so on. It means that the branch of the tree is constructed
with transaction itemsets in descending order of count.
4) The next transaction in the database is examined. The itemsets are
ordered in descending order of count. If any itemset of this transaction is
already present in another branch (for example in the 1st transaction),
then this transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another
itemset in this transaction.
5) Also, the count of the itemset is incremented as it occurs in the
transactions. Both the common node and new node count is increased by
1 as they are created and linked according to transactions.
Content Beyond the Syllabus
Frequent Pattern Algorithm Steps

6) The next step is to mine the created FP Tree. For this, the lowest node
is examined first along with the links of the lowest nodes. The lowest node
represents the frequency pattern length 1. From this, traverse the path in
the FP Tree. This path or paths are called a conditional pattern base.
Conditional pattern base is a sub-database consisting of prefix paths in the
FP tree occurring with the lowest node (suffix).
7) Construct a Conditional FP Tree, which is formed by a count of itemsets
in the path. The itemsets meeting the threshold support are considered in
the Conditional FP Tree.
8) Frequent Patterns are generated from the Conditional FP Tree.
Assessment Schedule

Proposed Date
24.04.2021

Actual Date
24.04.2021
Text & Reference Books

Sl. Book Name & Author Book


No.
1 Anand Rajaraman and Jeffrey David Ullman, "Mining of Massive Text Book
Datasets", Cambridge University Press, 2012.

2 David Loshin, "Big Data Analytics: From Strategic Planning to Text Book
Enterprise Integration with Tools, Techniques, NoSQL, and
Graph", Morgan Kaufmann/Elsevier Publishers, 2013.
3 EMC Education Services, "Data Science and Big Data Analytics: Text Book
Discovering, Analyzing, Visualizing and Presenting Data", Wiley
publishers, 2015.
https://bhavanakhivsara.files.wordpress.com/2018/06/data-
science-and-big-data-analy-nieizv_book.pdf
4 Bart Baesens, "Analytics in a Big Data World: The Essential Text Book
Guide to Data Science and its Applications", Wiley Publishers,
2015.
5 Dietmar Jannach and Markus Zanker, "Recommender Systems: Text Book
An Introduction", Cambridge University Press, 2010.
https://drive.google.com/file/d/1Wr4fllOj03X72rL8CHgVJ1dGxG
58N63S/view?usp=sharing
6 Kim H. Pries and Robert Dunnigan, "Big Data Analytics: A Reference
Practical Guide for Managers " CRC Press, 2015. Book
7 Jimmy Lin and Chris Dyer, "Data-Intensive Text Processing with Reference
MapReduce", Synthesis Lectures on Human Language Book
Technologies, Vol. 3, No. 1, Pages 1-177, Morgan Claypool
publishers, 2010.
Mini Project Suggestions

Implementation of Movie Recommender System

Recommender System is a system that seeks to predict or filter preferences according to


the user’s choices. Recommender systems are utilized in a variety of areas including
movies, music, news, books, research articles, search queries, social tags, and products
in general.
Build a basic recommendation system by suggesting items that are most similar to a
particular item, in this case, movies. It just tells what movies/items are most similar to
user’s movie choice.

To download the files, click on the links – .tsv file, Movie_Id_Titles.csv

Link to refer

https://www.geeksforgeeks.org/python-implementation-of-movie-recommender-system/
.
Virtusa Codelite Contest Project Discussion

Loan Prediction
https://drive.google.com/file/d/1r5Y1mC_7-3PjoiuagS-U15A02Ijgu_db/view?usp=sharing

Placement Prediction
https://drive.google.com/file/d/1VpOERm2CAZNTH7fBERU6Lv3nqRyVelPi/view?usp=shari
ng
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.

You might also like