Professional Documents
Culture Documents
Data Mining
Martin Ester
Solution Assignment 5
a) For which itemsets do you need to count the support in DB? For which itemsets do you need to
count the support in DB ? Provide proofs for your answers. (30 marks)
We need to count the support in DB of all itemsets that are frequent in DB but not in DB,
since these itemsets may become frequent in DB DB , but their support in DB was not
stored (they were infrequent).
We need to count the support in DB of all itemsets that are frequent in DB but not in
DB , since these itemsets may become frequent in DB DB , but their support in
DB was not returned by the Apriori-algorithm (they were infrequent).
Let denote the (relative) minimum support threshold, and let A denote some itemset.
| {t DB | A t} |
If A is not frequent in DB, then , i.e. | {t DB | A t} | | DB | (1) .
| DB |
| {t DB DB | A t} | | DB | | DB | ( | DB | | DB | ) (3)
Since DB and DB are disjoint, | DB DB | | DB | | DB | (4) .
| {t DB DB | A t} | ( | DB | | DB | ) ( | DB | | DB | )
,
| DB DB | | DB DB | | DB | | DB |
i.e. A is infrequent in DB DB .
b) Describe the entire method for incrementally updating the set of frequent itemsets. (30 marks)
Let support (i,D) denote the number of supporting tranasctions for itemset i in the set of
transactions D. We assume that the support in DB for all elements of FDB had been stored.
The following method IncrementalApriori returns the set of all frequent itemsets in
DB DB :
F DB = Apriori( DB ,min-supp);
return Result
The following are some of the HTML tags that represent useful structural information:
It can be assumed that, in general, terms occurring in these tags are especially important for
document classification. To exploit this information, term occurrences within these tags could
receive higher weights, e.g. in <title>: weight * 10, in <h1>: weight * 5, in <h2>: weight *
4, . . .
While this approach makes some sense, it is rather limited because HTML tags specify the
layout, not the semantics and are, therefore, not very reliable for the purpose of term weighting.
(b) What features would be useful to represent an image contained in an HTML document? We
make the simplifying assumption that an HTML document contains no more than one image.
How can you exploit the image features (besides the textual features) in the classification
algorithm? What are the limitations of this approach? Answer these questions for a specific
classification algorithm of your choice. Describe your overall approach. (20 marks)
Assuming that we do not perform object identification / image segmentation (which can be very
difficult), the most useful image features would be colour histograms. Thus, an image could be
represented by a colour feature vector with one dimension per relevant colour, recording the
number of pixels with that colour in the corresponding image.
A simple approach to combine text and image content would be to concatenate the term
frequency vectors and the colour frequency vectors of a document. The main advantage of this
approach is that off-the-shelf classification methods are directly applicable to the concatenated
feature vectors. However, text and colour feature vectors may have largely different
dimensionalities and corresponding weights in the classifier, e.g. in an SVM. The implicit
weights given by the ratio of the text and image vector dimensionalities may not be appropriate
from an application point of view.