Professional Documents
Culture Documents
COURSE OUTCOMES
CO1 Interpret big data analytics frameworks such as Hadoop Ecosystem and spark architecture and apply for any
specific case studies
CO2 Use real-time analytical methods on streaming datasets to react quickly to customer needs
CO3 Analyze and develop transferable skills needed to create and architect big data systems
CO4 Describe about a wide range of big data tools and techniques
CO5 Analyze big data problems by identifying key requirements, alternative solutions and evaluation methods
List the rules that must be followed when representing a stream by buckets.
6 rules – 2 marks
2. U CO2 2
Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3, CO2
4, 2, 1, 2. What is the third moment of this stream?
Table : (1 Mark)
Element Occurrence 1st moment 2nd moment 3rd moment
1 3 3 9 27
2 2 2 4 8
6. 3 2 2 4 8 AP 2
4 2 2 4 8
=9 =21 =51
First moment: length of the stream is 9
Second moment: stream is 21
Third moment: stream is 51 (1 Mark)
Let X, Y be two itemsets, and let supp(X) denote the support of itemset X.
thenthe confidence of the rule X->Y denoted by conf(X->Y). Write the
7. formula for conf(X->Y). A CO3 2
supp(X∪Y)/supp(X)
How does the Apriori algorithm work to mine frequent item sets and learn
association rule over databases?
8. Scan the transaction data base to get the support 'S' each 1-itemset, U CO4 2
compare 'S' with min_sup, and get a support of 1-itemsets, Use join to
generate a set of candidate k-item set. Use apriori property to prune the
unfrequented k-item sets from this set.
9. AP CO2 2
Table : (1 Mark)
Element Occurrence 1st moment 2nd moment 3rd moment
1 3 3 9 27
2 2 2 4 8
11. 3 2 2 4 8 AP 2
4 2 2 4 8
=9 =21 =51
First moment: length of the stream is 9
Second moment: stream is 21
Third moment: stream is 51
(1 Mark)
[2 Marks]
24. AP CO2 2
[2 Marks]
26. AP CO5 2
[2
Marks]
How Hybrid Method will help in evaluating the score value of item sets.
Ans:
The initial sample has b baskets, c is the decay constant for the decaying window,
27. and the minimum score we wish to accept for a frequent itemset in the decaying U CO2 2
window is s. Then the support threshold for the initial run of the frequent-itemset
algorithm is bcs. If an itemset I is found to have support t in the sample, then it is
initially given a score of t/(bc). [2 Marks]
b) A database has five transactions. Let min sup = 60% and min conf = 75%. A CO5 8
.
I proceed to scan the database again to generate frequent 3-itemsets.
Sets {E, K}, {K, O}, {E, O} make {E, K, O} possible. Likewise, {E,
O}, {E, Y}, {O, Y} make {E, O, Y}
3 a) Draw and explain the architecture of general data stream management system A CO2 8
Answer:
Ans:
X1.element( 2 marks)
and so set X2.element = d and X2.value = 1. Positions 9 and 10 hold a and b,
so they do not affect X1 or X2. Position 11 holds d so we set X2.value = 2, and
position 12 holds c so we set X1.value = 3.
X2.element(2 marks)
X3.element(2 marks)
Generate the association rules with single item on the left hand and right of
association rule. Compute the rule that has highest confidence.
(Step 1:1 Mark)
(Step 2:1 Mark)
(Step 3: 1 Mark)
(Step 4:1 Mark)
Association rule: (set 1,2,3,4) : 3 Marks
Strong Association rule:1 Mark
i) Consider the following Set as basket, the words as item and the support AP CO5 8
5 threshold s = 3. Analyze the given basket to design a market basket model for
singleton set, doubleton set, triple set.
1.{Cat, and, dog, bites}
2. {Yahoo, news, claims, a, cat, mated, with, a, dog, and, produced,
viable, offspring}
3. {Cat, killer, likely, is, a, big, dog}
4. {Professional, free, advice, on, dog, training, puppy, training}
5. {Cat, and, kitten, training, and, behavior}
6. {Dog, &, Cat, provides, dog, training, in, Eugene, Oregon}
7. {“Dog, and, cat”, is, a, slang, term, used, by, police, officers, for, a,
male– female, relationship}
8. {Shop, for, your, show, dog, grooming, and, pet, supplies}
Ans:
The empty set is a subset of any set, the support for ∅ is 8. However, we
shall not generally concern ourselves with the empty set, since it tells us
nothing. Among the singleton sets, obviously {cat} and {dog} are
quite frequent. “Dog” appears in all but basket (5), so its support is
7, while “cat” appears in all but (4) and (8), so its support is 6. The
word “and” is also quite frequent; it appears in (1), (2), (5), (7), and
(8), so its support is 5. The words “a” and “training” appear in three
sets, while “for” and “is” appear in two each. No other word
appears more than once. Suppose that we set our threshold at s = 3.
Then there are five frequent singleton itemsets: {dog}, {cat},
{and}, {a}, and {training}. Now, let us look at the doubletons. A
doubleton cannot be frequent unless both items in the set are
frequent by themselves. Thus, there are only ten possible frequent
doubletons. [4 Marks]
training a and cat dog 4, 6 2, 3, 7 1, 2, 7, 8 1, 2, 3, 6, 7 cat 5, 6 2, 3, 7 1, 2,
5, 7 and 5 2, 7 a none
For example, we see from the table of Fig. 6.2 that doubleton{dog,
training} appears only in baskets (4) and (6). Therefore, its support
is 2, and it is not frequent. There are five frequent doubletons if s =
3; they are
{dog, a} {dog, and} {dog, cat} {cat, a} {cat, and}
Each appears at least three times; for instance, {dog, cat} appears
five times. Next, let us see if there are frequent triples. In order to
be a frequent triple, each pair of elements in the set must be a
frequent doubleton. For example, {dog, a, and} cannot be a
frequent itemset, because if it were, then surely {a, and} would be
frequent, but it is not. The triple {dog, cat, and} might be frequent,
because each of its doubleton subsets is frequent. Unfortunately, the
three words appear together only in baskets (1) and (2), so there are
in fact no frequent triples. The triple{dog, cat, a}might be
frequent, since its doubletons are all frequent. In fact, all three
words do appear in baskets (2), (3), and (7), so it is a frequent triple.
No other triple of words is even a candidate for being a frequent
triple, since for no other triple of words are its three doubleton
subsets frequent. As there is only one frequent triple, there can be
no frequent quadruples or larger sets. [4 Marks]
Enumerate the value and estimate the rules to divide the following values CO2
ii) AP 8
. . 1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0 in to buckets using Datar-
Gionis-Indyk-Motwani Algorithm
Ans:
The simplest case of an algorithm called DGIM. This version of the algorithm
uses O(log2 N) bits to represent a window of N bits, and allows us to estimate
the number of 1’s in the window with an error of no more than 50%.[4 Marks]
[4 Marks]
6 i) Consider a generalization of the problem of counting distinct elements in a AP CO2 16
stream. The problem, called computing “moments. Find and formulate
probability s/n for Second Moment using Alon-Matias-Szegedy Algorithm.
Ans:
Estimate the second moment of the stream using a limited amount of
space; the more space we use, the more accurate the estimate will be. We
compute some number of variables. For each variable X, we store:
1. A particular element of the universal set, which we refer to as X.element, and
2. An integer X.value, which is the value of the variable. To determine the value
of a variable X, we choose a position in the stream between 1 and n, uniformly
and at random. Set X.element to be the element found there, and initialize
X.value to 1. As we read the stream, add 1 to X.value each time we encounter
another occurrence of X.element.
When we reach position 3, we find element c, so we set X1.element =
c and X1.value = 1. Position 4 holds b, so we do not change X1.
Likewise, nothing happens at positions 5 or 6. At position 7, we see c
again, so we set X1.value = 2. At position 8 we find d, and so set
X2.element = d and X2.value = 1. Positions 9 and 10 hold a and b, so they
do not affect X1 or X2. Position 11 holds d so we set X2.value = 2, and
position 12 holds c so we set X1.value = 3. At position 13, we find
element a, and so set X3.element = a and X3.value = 1. Then, at position
14 we see another a and so set X3.value = 2. Position 15, with element b
does not affect any of the variables, so we are done, with final values
X1.value = 3 and X2.value = X3.value = 2. [8 Marks]
We can derive an estimate of the second moment from
any variable X. This estimate is n(2X.value −1).
[8 Marks]