You are on page 1of 3

CMPT 741 Fall 2009

Data Mining
Martin Ester

Solution Assignment 5

Total marks: 100 (20 % of the assignments)


Due date: December 1, 2009

Assignment 5.1 (60 marks)


Mining association rules can be expensive. In a large, dynamic database of transactions, we want to
store the frequent itemsets and incrementally update them upon arrival of a set of new transactions.
Let DB denote the last state of our database and DB a set of new transactions. We assume that we
use the Apriori-algorithm and have stored all frequent itemsets in DB with respect to min-supp
(relative frequency threshold). The task is to incrementally determine the frequent itemsets
in DB DB with respect to min-supp, without re-applying the Apriori-algorithm to the whole
updated database DB DB .

a) For which itemsets do you need to count the support in DB? For which itemsets do you need to
count the support in DB ? Provide proofs for your answers. (30 marks)

We need to count the support in DB of all itemsets that are frequent in DB but not in DB,
since these itemsets may become frequent in DB DB , but their support in DB was not
stored (they were infrequent).

We need to count the support in DB of all itemsets that are frequent in DB but not in
DB , since these itemsets may become frequent in DB DB , but their support in
DB was not returned by the Apriori-algorithm (they were infrequent).

If an itemset is neither frequent in DB nor in DB , then it cannot be frequent in


DB DB , according to the following proof:

Let denote the (relative) minimum support threshold, and let A denote some itemset.
| {t DB | A t} |
If A is not frequent in DB, then , i.e. | {t DB | A t} | | DB | (1) .
| DB |

If A is not frequent in DB , then


| {t DB | A t} |
, i.e. i.e. | {t DB | A t} | | DB | (2) .
| DB |

Using (1), (2), and the disjointness of DB and DB we obtain

| {t DB DB | A t} | | DB | | DB | ( | DB | | DB | ) (3)
Since DB and DB are disjoint, | DB DB | | DB | | DB | (4) .

Using (3) and (4), we conclude

| {t DB DB | A t} | ( | DB | | DB | ) ( | DB | | DB | )
,
| DB DB | | DB DB | | DB | | DB |

i.e. A is infrequent in DB DB .

b) Describe the entire method for incrementally updating the set of frequent itemsets. (30 marks)

Let support (i,D) denote the number of supporting tranasctions for itemset i in the set of
transactions D. We assume that the support in DB for all elements of FDB had been stored.

The following method IncrementalApriori returns the set of all frequent itemsets in
DB DB :

IncrementalApriori (DB, FDB, DB , min-supp) // FDB denotes the frequent itemsets in DB

F DB = Apriori( DB ,min-supp);

Result = FDB F DB ; // itemsets that are frequent in DB and in F DB ;

Scan DB to calculate the support in DB of all elements in F DB FDB ;

for each i in F DB FDB do


support( i, DB) support( i, DB)
if min supp
| DB DB |
then Result Result {i} ;

Scan DB to calculate the support in DB of all elements in FDB F DB ;

for each i in FDB F DB do


support( i, DB) support( i, DB)
if min supp
| DB DB |
then Result Result {i} ;

return Result

Assignment 5.2 (40 marks)


Consider the task of classifying HTML documents. Compared to simple text documents, HTML
documents do not only have a textual content, but also a certain structure represented by the HTML
tags. In addition, HTML documents may also contain images and further multimedia datatypes.
The objective of this assignment is to discuss ways to incorporate structure information and
multimedia data for improving the classification accuracy of HTML documents.
(a) What HTML tags represent structural information useful for classification? How can this
information be included into the document representation? What are the limitations of this
approach? Describe your overall approach. (20 marks)

The following are some of the HTML tags that represent useful structural information:

<title> Defines the document title


<h1> to <h6> Defines header 1 to header 6
<b> Defines bold text

It can be assumed that, in general, terms occurring in these tags are especially important for
document classification. To exploit this information, term occurrences within these tags could
receive higher weights, e.g. in <title>: weight * 10, in <h1>: weight * 5, in <h2>: weight *
4, . . .

While this approach makes some sense, it is rather limited because HTML tags specify the
layout, not the semantics and are, therefore, not very reliable for the purpose of term weighting.

(b) What features would be useful to represent an image contained in an HTML document? We
make the simplifying assumption that an HTML document contains no more than one image.
How can you exploit the image features (besides the textual features) in the classification
algorithm? What are the limitations of this approach? Answer these questions for a specific
classification algorithm of your choice. Describe your overall approach. (20 marks)

Assuming that we do not perform object identification / image segmentation (which can be very
difficult), the most useful image features would be colour histograms. Thus, an image could be
represented by a colour feature vector with one dimension per relevant colour, recording the
number of pixels with that colour in the corresponding image.

A simple approach to combine text and image content would be to concatenate the term
frequency vectors and the colour frequency vectors of a document. The main advantage of this
approach is that off-the-shelf classification methods are directly applicable to the concatenated
feature vectors. However, text and colour feature vectors may have largely different
dimensionalities and corresponding weights in the classifier, e.g. in an SVM. The implicit
weights given by the ratio of the text and image vector dimensionalities may not be appropriate
from an application point of view.

You might also like