You are on page 1of 2

Why use sampling, to check frequency of an itemset

Mining from sample are way faster


The size of the sample impacts the probability of errors.

Experiment with two possible outcomes is called a Bernoulli


If X is a random variable that denotes number of successes in n trials.
In n experiments we expect pn successes. The chance that we're above or below this
pn is formalized in formula

In sampling:
p: the support of Z in database
n: is sample size
m: number of transactions that contain all items of set Z (for which we check)

Estimate probably (\theta? (slides) and approximately (\epsilon) correct


If you want to be more precies and and sure the sample size n increases.
A table is shown of these numbers.
If our sample size is at least n (which is enough to

(Big V means OR: )

To convert this from one item set to all item sets we need to use the union bound.

There are two types of errors.


1. Frequent in sample but not in database is easily counteracted
- Solve this by making one extra scan over the dataset

2. Itemsets that are not frequent on sample but frequent on database


- We discover more as frequent in sample, and then discard all non frequent ones in
database
- But! we don't want to put threshold on 1 for sample so we get all combinations,
then we have way too many sets

For this we lower the threshold. So we have a lower chance to miss item semts in
the sample.
How much we want to lower depends is denoted in mhu/mju.
The probabilty of being epsilon or more away is e^{-2\epsilon^2 * n}

p is bigger than p^ (p-hat)

We can still miss frequent item sets even with lowered threshold.
Can we add these to the frequent item sets.
A theorem is devoted to this in slides.

The algo is:


Get frequent itemsets
Perform scan on database.
If there are items in border that are frequent.
Add them to the lettices we need to check.
Then scan again for those lettices
(line 7 generates new candidates)

Observation.
In computing the sample size.
p was a probabilty that random transaciton supports itemset Z.
That is, 1 of Z < setof

This is an example of classiefiers.


IN machine learning what we do is 'probably approximately corect learning'

Now we are working with classification.


We are only interested in condition distribution => P(Y=y|X=x)
Where X is what we already know. And Y is not known or more difficult to observe.
//We want to predict Y from X.
We will predict which Y is most likely from X
If we only want to know the most likely one, we do not care about all

We also want to know how good our classifier is.


Intuitively easy since we assume there is a true function
The more ofthen the function agrees with a true value, the better it is.

This however fails mathamatically.


The domain we are dealing with is infinite.
Each wrong prediction is equally bad.

To solve this we go to probabilities


We want to mimize the risk of classifying it wrong.
This is not easy, since we might be overfitting on the sample.
The risk is then mimized by assigning 1 to everything. But its true loss is much
bigger

Hypotheses set? Example: Ask a person whether another person is tall.


The hypothesis, initial value, is theta.
Every time a wrong prediction is done it gets a penalty of 1.
The goal is to mimize the penalty.
When the adversary knows the threshold value, theta that your classifier has the
penalty is infinite.
There is no way to learn the true theta value

We first look at finite cases. Where there are not infinitely many decimals. So
Ints
We will

You might also like