You are on page 1of 20

COMPUTER SCIENCE AND ENGINEERING

17CS318 DATA ANALYTICS


CONTINUOUS INTERNAL ASSESSMENT -III KEY
Date : 20.03.2020 Semester : VI
Duration : 1.30 Hours Max. Marks : 50 Marks

COURSE OUTCOMES
CO1 Interpret big data analytics frameworks such as Hadoop Ecosystem and spark architecture and apply for any
specific case studies
CO2 Use real-time analytical methods on streaming datasets to react quickly to customer needs
CO3 Analyze and develop transferable skills needed to create and architect big data systems
CO4 Describe about a wide range of big data tools and techniques
CO5 Analyze big data problems by identifying key requirements, alternative solutions and evaluation methods

ANSWER ALL QUESTIONS


PART A (9 x 2 Marks = 18 Marks) BT CO MARKS
Find the Euclidean Distance between the two data point A(1,2), and B(2.3)

Distance between a and b= sqrt{(3-3)^2 + (2-1)^2} - 1 mark

Distance between a and b = sqrt{(0)^2 + (1)^2}


1. AP CO5 2
Distance between a and b = sqrt{1}

Distance between a and b =1 - 1 mark

List the rules that must be followed when representing a stream by buckets.
6 rules – 2 marks

2. U CO2 2

3. Specify the importance of analysis of bloom’s filtering in mining data AP CO2 2


streams.
Specify the need of association mining in frequent item set
Frequent mining is generation of association rules from a Transactional
Dataset. If there are 2 items X and Y purchased frequently then its good
to put them together in stores or provide some discount offer on one item
4. on purchase of other item. This can really increase the sales. For example R CO2 2
it is likely to find that if a customer buys Milk and bread he/she also
buys Butter.
So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So seller can
suggest the customer to buy butter if he/she buys Milk and Bread.
Prove by induction on m that 1+3+5+· · ·+(2m−1) = m2
=
5. A CO5 2

Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3, CO2
4, 2, 1, 2. What is the third moment of this stream?

Table : (1 Mark)
Element Occurrence 1st moment 2nd moment 3rd moment
1 3 3 9 27
2 2 2 4 8
6. 3 2 2 4 8 AP 2
4 2 2 4 8
=9 =21 =51
First moment: length of the stream is 9
Second moment: stream is 21
Third moment: stream is 51 (1 Mark)

Let X, Y be two itemsets, and let supp(X) denote the support of itemset X.
thenthe confidence of the rule X->Y denoted by conf(X->Y). Write the
7. formula for conf(X->Y). A CO3 2

supp(X∪Y)/supp(X)
How does the Apriori algorithm work to mine frequent item sets and learn
association rule over databases?
8. Scan the transaction data base to get the support 'S' each 1-itemset, U CO4 2
compare 'S' with min_sup, and get a support of 1-itemsets, Use join to
generate a set of candidate k-item set. Use apriori property to prune the
unfrequented k-item sets from this set.

Give an outline of the Limited Pass algorithm?

9. AP CO2 2

Prove by induction on m that 1+3+5+· · ·+(2m−1) = m2


The expected value of n(2X.value − 1) is the average
10. over all positions i A CO5 2
between 1 and n of n(2c(i) − 1)
= (2 Marks)
Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3, CO2
4, 2, 1, 2. What is the third moment of this stream?

Table : (1 Mark)
Element Occurrence 1st moment 2nd moment 3rd moment
1 3 3 9 27
2 2 2 4 8
11. 3 2 2 4 8 AP 2
4 2 2 4 8
=9 =21 =51
First moment: length of the stream is 9
Second moment: stream is 21
Third moment: stream is 51
(1 Mark)

How market basket analysis helps in business analytics


Description: (1 Mark)
12. For example, IF{Beer, meat}, THEN {chips} or IF{Diaper, Onion}, A CO3 2
THEN {Beer} (1 Mark)

What is the purpose of limited pass algorithms?

Many algorithms compute exact collection of frequent itemsets of size k in


k passes
Eg.A-Priori, PCY, Multistage, Multihash
There are many applications where it is not essential to discover every
13. frequent itemset.The algorithms for frequent itemsets discussed so far use U CO4 2
one pass for each size of item set investigated. If main memory is too small
to hold the data and the space needed to count frequent itemsets of one
size, there does not seem to be any way to avoid k passes to compute the
exact collection of frequent itemsets.However, there are many applications
where it is not essential to discover every frequent itemset.(2 Marks)
Analyze the role of bloom’s filters in selecting data streams.
A Bloom filter consists of:
1. An array of n bits, initially all 0’s.
2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps
14. “key” values to n buckets, corresponding to the n bits of the bit-array. AP CO2 2
3. A set S of m key values.
The purpose of the Bloom filter is to allow through all stream elements
whose keys are in S, while rejecting most of the stream elements whose keys
are not in S. (2 Marks)
The most even distribution of these eleven elements would have one
appearing 10 times and the other ten appearing 9 times each. Find the
15. surprise number. A CO3 2
Ans: 10 2 + 10 × 9 2 = 910 (2 Marks)
Formulate the applications of frequent item set?
● Related concepts
16. A CO2 2
● Plagiarism
● Biomarkers (2 Marks)
What is the main idea of estimating moments?
Computing “moments,” involves the distribution of frequencies of different
17. elements in the stream. We shall define moments of all orders and A CO3 2
concentrate on computing second moments, from which the general
algorithm for all moments is a simple extension.(2 Marks)

What is association rule? List the two interesting measures of an association


rule?
Association rules are created by searching data for frequent if-then patterns
18. U CO5 2
and using the criteria support and confidence to identify the most important
relationships. (2 Marks)

Illustrate the set of candidate pairs C2 to be those pairs{i, j}


Ans:
19. 1. i and j are frequent items.
U CO2 2
2. {i,j} hashes to a frequent bucket.[2 Marks]
Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c, a, a, b. The length of the stream
is n= 15. Find the second moment and evaluate n(2X1.value -1).
20. Ans: AP CO5 2
52 +42 +32 +32 = 59
n(2X1.value − 1) = 15 × (2 × 3 − 1) = 75. [2 Marks]
Illustrate the two different conditions to find probability of tai length to estimate
2R
Ans:
21. 1. If m is much larger than 2r, then the probability that we shall find a tail of AP CO2 2
length at least r approaches 1. [1 Marks]
2. If m is much less than 2r, then the probability of finding a tail length at
least r approaches 0. [1 Marks]
How the DGIM algorithm works to count the 1’s in each bit.
Ans:
The current timestamp be t. Then the two buckets with one
22. U CO5 2
1, having timestamps t−1 and t−2 are completely included
in the answer. The bucket of size 2, with timestamp t − 4,
is also completely included. [2 Marks]
Formulate the fractional error of DGIM algorithm to maintain the condition of
Window.
Ans:
23. The fractional error AP CO2 2

[2 Marks]

Sketch the decaying window and a fixed-length window of equal weight.

24. AP CO2 2

[2 Marks]

Describe the mechanics of the second pass of the A-Priori algorithm.


Ans:
1. For each basket, look in the frequent-items table to see which of its items are
25. frequent. U CO2 2
2. In a double loop, generate all pairs of frequent items in that basket.
3. For each such pair, add one to its count in the data structure used to store
counts. [2 Marks]
Formulate similar of bloom filtering to the analysis the given target.
Ans:

26. AP CO5 2

[2
Marks]
How Hybrid Method will help in evaluating the score value of item sets.
Ans:
The initial sample has b baskets, c is the decay constant for the decaying window,
27. and the minimum score we wish to accept for a frequent itemset in the decaying U CO2 2
window is s. Then the support threshold for the initial run of the frequent-itemset
algorithm is bcs. If an itemset I is found to have support t in the sample, then it is
initially given a score of t/(bc). [2 Marks]

PART B (2 X 16 Marks = 32 Marks) BT CO MARKS


1 a) Evaluate the market basket data and its use in main memory. AP CO2 8
Answer :
The Market Basket Model :
Market Basket Analysis (Association Analysis) is a mathematical
modeling technique based upon the theory that if you buy a certain
group of items, you are likely to buy another group of items.
It is used to analyze the customer purchasing behavior and helps in
increasing the sales
and maintain inventory by focusing on the point of sale transaction
data.
● A large set of items ex, things sold in a supermarket.
● A large set of baskets each of which is a small set of the
items, ex. The things one customer buys on one day.

In many algorithms to find frequent itemsets we need to worry about


how main memory is used.
Typically, data is kept in a “flat file” rather than a database system.
Stored on disk. Stored basket-by-basket. Expand baskets into pairs,
triples, etc. as you read baskets.
The true cost of mining disk-resident data is usually the number of
disk I/O’s. In practice, association-rule algorithms read the data in
passes --- all baskets read in turn. Thus, we measure the cost by the
number of passes an algorithm takes.
● As we read baskets we need to count something, eg.
Occurrences of pairs.
● The number of different things we can count is limited by
main memory.
● Swapping counts in/out is a disaster.
b) Write notes on the following - CO2 8
Counting ones in a window - 8 marks
Answer :
Explanation about DGIM algorithm :
Designed to find the number 1’s in a data set. This
algorithm uses
O(log²N) bits to represent a window of N bit, allows to
estimate the
number of 1’s in the window with and error of no more
than 50%.
So this algorithm gives a 50% precise answer.
storage requirements :
Each bucket can be represented by O(log n) bits. If the window has
length N, then there are no more than N 1’s, surely. Suppose the
largest bucket is of size 2j . Then j cannot exceed log2 N, or else
there are more 1’s in this bucket than there are 1’s in the entire
window. Thus, there are at most two buckets of all sizes from log2 N
down to 1, and no buckets of larger sizes. We conclude that there are
O(log N) buckets. Since each bucket can be represented in O(log N)
bits, the total space required for all the buckets representing a
window of size N is O(log2 N).
Query answering :
Suppose we are asked how many 1’s there are in
the last k bits of the window, for some 1 ≤ k ≤
N. Find the bucket b with the earliest
timestamp that includes at least some of the k
most recent bits. Estimate the number of 1’s to
be the sum of the sizes of all the buckets to the
right (more recent) than bucket b, plus half the
size of b itself.

Maintaining the DGIM condition:


. When a new bit comes in, we may need to modify the
buckets, so
they continue to represent the window and continue to satisfy
the
DGIM conditions. First, whenever a new bit enters: • Check
the
leftmost (earliest) bucket. If its timestamp has now reached the
current timestamp minus N, then this bucket no longer has
any of its
1’s in the window. Therefore, drop it from the list of buckets.
Now,
we must consider whether the new bit is 0 or 1. If it is 0, then
no
further change to the buckets is needed. If the new bit is a 1,
however, we may need to make several changes. First: • Create
a new
bucket with the current timestamp and size 1. If there was
only one
bucket of size 1, then nothing more needs to be done.
However, if
there are now three buckets of size 1, that is one too many.
We fix
this problem by combining the leftmost (earliest) two buckets
of size
1. • To combine any two adjacent buckets of the same size,
replace
them by one bucket of twice the size. The timestamp of the
new
bucket is the timestamp of the rightmost (later in time) of the
two
buckets. Combining two buckets of size 1 may create a third
bucket
of size 2. If so, we combine the leftmost two buckets of size 2 into a
bucket of size 4. That, in turn, may create a third bucket of size 4, and if so
we combine the leftmost two into a bucket of size 8. This process may
ripple through the bucket sizes, but there are at most log2 N different sizes,
and the combination of two adjacent buckets of the same size only requires
constant time. As a result, any new bit can be processed in O(log N) time.
Reducing error:
Instead of allowing either one or two of each size
bucket, suppose we allow either r − 1 or r of each of
the exponentially growing sizes 1, 2, 4, . . ., for some
integer r > 2. In order to represent any possible
number of 1’s, we must relax this condition for the
buckets of the largest size present; there may be
any number, from. If we get r + 1 buckets of size 2j ,
combine the leftmost two into a bucket of size 2 j+1.
That may, in turn, cause there to be r + 1 buckets of
size 2j+1, and if so we continue combining buckets of
larger sizes. 1 to r, of these.
However, because there are more buckets of smaller
sizes, we can get a stronger bound on the error. We
saw there that the largest relative error occurs
when only one 1 from the leftmost bucket b is within
the query range, and we therefore overestimate the
true count. Suppose bucket b is of size 2j . Then the
true count is at least 1 + (r − 1)(2j−1 + 2j−2 + · · · +
1) = 1 + (r − 1)(2j − 1). The overestimate is 2 j−1 − 1.
Thus, the fractional error is 2 j−1 − 1 1 + (r −
1)(2j − 1)
2 a) Explain Flajolet-Martin algorithm and The Alon-Matias-Snegedy AP CO2 8
algorithm for handling moments in data streams.
Answer:
FM:
The idea behind the Flajolet-Martin Algorithm is that the
more different elements we see in the stream, the more
different hash-values we shall see. As we see more
different hash-values, it becomes more likely that one of
these values will be “unusual.” The particular unusual
property we shall exploit is that the value ends in many 0’s,
although many other options exist. Whenever we apply a
hash function h to a stream element a, the bit string h(a)
will end in some number of 0’s, possibly none. Call this
number the tail length for a and h. Let R be the maximum
tail length of any a seen so far in the stream. Then we shall
use estimate 2R for the number of distinct elements seen in
the stream. This estimate makes intuitive sense. The
probability that a given stream element a has h(a) ending
in at least r 0’s is 2−r . Suppose there are m distinct
elements in the stream. Then the probability that none of
them has tail length at least r is (1 − 2 −r ) m. This sort of
expression should be familiar by now. We can rewrite it as
(1 − 2 −r ) 2 r m2 −r . Assuming r is reasonably large, the
inner expression is of the form (1 − ǫ) 1/ǫ, which is
approximately 1/e. Thus, the probability of not finding a
stream element with as many as r 0’s at the end of its hash
value is e −m2 −r . We can conclude: 1.
If m is much larger than 2r , then the probability that we shall
find a tail of length at least r approaches 1.
2. If m is much less than 2r , then the probability of finding a
tail length at least r approaches 0.
AMS:
For now, let us assume that a stream has a particular length n.
Suppose we do not have enough space to count all the mi ’s for all
the elements of the stream. We can still estimate the second moment
of the stream using a limited amount of space; the more space we
use, the more accurate the estimate will be. We compute some
number of variables. For each variable X, we store: 1. A particular
element of the universal set, which we refer to as X.element, and 2.
An integer X.value, which is the value of the variable. To determine
the value of a variable X, we choose a position in the stream between
1 and n, uniformly and at random. Set X.element to be the element
found there, and initialize X.value to 1. As we read the stream, add 1
to X.value each time we encounter another occurrence of X.element.

b) A database has five transactions. Let min sup = 60% and min conf = 75%. A CO5 8

Find all frequent itemsets using Apriori method.

Database is scanned once to generate frequent 1-itemsets. To do


this, I use absolute support, where duplicate values are counted only
once per TID. The total number of TID is 5, so minimum support of
60% is equivalent to 3/5. Thus itemsets with 1 or 2 support counts
are eliminated.

Now, database is scanned second time to generate frequent 2-


itemsets. The possible combinations are 5!/(3!2!) = 10. Using
absolute support, each combination is counted per TID, and
combinations that are below support value of 3 are eliminated.

.
I proceed to scan the database again to generate frequent 3-itemsets.
Sets {E, K}, {K, O}, {E, O} make {E, K, O} possible. Likewise, {E,
O}, {E, Y}, {O, Y} make {E, O, Y}

Frequent 4-itemsets cannot be generated, because sets {K, O, Y} and


{E, K, Y} are missing. So, all frequent itemsets have been found.

3 a) Draw and explain the architecture of general data stream management system A CO2 8
Answer:

In analogy to a database-management system, we can view a stream processor as


a kind of data-management system, the high-level organization of which is
suggested in Fig. 4.1. Any number of streams can enter the system. Each stream
can provide elements at its own schedule; they need not have the same data
rates or data types, and the time between elements of one stream need not be
uniform. The fact that the rate of arrival of stream elements is not under the
control of the system distinguishes stream processing from the processing of
data that goes on within a database-management system. The latter system
controls the rate at which data is read from the disk, and therefore never has to
worry about data getting lost as it attempts to execute queries. Streams may be
archived in a large archival store, but we assume it is not possible to answer
queries from the archival store. It could be examined only under special
circumstances using time-consuming retrieval processes. There is also a working
store, into which summaries or parts of streams may be placed, and which can
be used for answering queries. The working store might be disk, or it might be
main memory, depending on how fast we need to process queries. But either
way, it is of sufficiently limited capacity that it cannot store all the data from all
the streams.
b Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c, a, a, b. The length of the AP CO2 8
) stream is n = 15. Since a appears 5 times, b appears 4 times, and c and d appear
three times each, find the second moment for the stream and also assume that at
“random” we pick the 3rd, 8th, and 13th positions to define these three
variables and also find the average of three variables.

Ans:

5 2+4 2+3 2+3 2 = 59. (1 Mark)


When we reach position 3, we find element c, so we set X1.element = c and
X1.value = 1.
Position 4 holds b, so we do not change X1. Likewise, nothing happens at
positions 5 or 6.
At position 7, we see c again, so we set X1.value = 2. At position 8 we find d,

X1.element( 2 marks)
and so set X2.element = d and X2.value = 1. Positions 9 and 10 hold a and b,
so they do not affect X1 or X2. Position 11 holds d so we set X2.value = 2, and
position 12 holds c so we set X1.value = 3.

X2.element(2 marks)

At position 13, we find element a, and so set X3.element = a and X3.value = 1.


Then, at position 14 we see another a and so set X3.value = 2. Position 15, with
element b does not affect any of the variables, so we are done, with final values
X1.value = 3 and X2.value = X3.value = 2.

X3.element(2 marks)

From X1 we derive the estimate n(2X1.value − 1) = 15 ×


(2 × 3 − 1) = 75. The other two variables, X2 and X3, each
have value 2 at the end, so their estimates are 15×(2
×2−1) = 45. The average of the three estimates is 55, a
fairly close approximation. (1 Mark)

4 a) Explain Flajolet-Martin algorithm and The Alon-Matias-Snegedy algorithm for AP CO2 8


handling moments in data streams.
Answer:
FM:
The idea behind the Flajolet-Martin Algorithm is that the more
different elements we see in the stream, the more different hash-
values we shall see. As we see more different hash-values, it
becomes more likely that one of these values will be “unusual.” The
particular unusual property we shall exploit is that the value ends in
many 0’s, although many other options exist. Whenever we apply a
hash function h to a stream element a, the bit string h(a) will end in
some number of 0’s, possibly none. Call this number the tail length
for a and h. Let R be the maximum tail length of any a seen so far in
the stream. Then we shall use estimate 2R for the number of distinct
elements seen in the stream. This estimate makes intuitive sense. The
probability that a given stream element a has h(a) ending in at least
r 0’s is 2−r . Suppose there are m distinct elements in the stream.
Then the probability that none of them has tail length at least r is (1
− 2 −r ) m. This sort of expression should be familiar by now. We can
rewrite it as (1 − 2 −r ) 2 r m2 −r . Assuming r is reasonably large, the
inner expression is of the form (1 − ǫ) 1/ǫ, which is approximately
1/e. Thus, the probability of not finding a stream element with as
many as r 0’s at the end of its hash value is e −m2 −r . We can
conclude: 1.
If m is much larger than 2r , then the probability that we shall find a tail
of length at least r approaches 1.
2. If m is much less than 2r , then the probability of finding a tail length
at least r approaches 0.
AMS:
For now, let us assume that a stream has a particular length n.
Suppose we do not have enough space to count all the mi ’s for all the elements
of the stream. We can still estimate the second moment of the stream using a
limited amount of space; the more space we use, the more accurate the estimate
will be. We compute some number of variables. For each variable X, we store:
1. A particular element of the universal set, which we refer to as X.element, and
2. An integer X.value, which is the value of the variable. To determine the value
of a variable X, we choose a position in the stream between 1 and n, uniformly
and at random. Set X.element to be the element found there, and initialize
X.value to 1. As we read the stream, add 1 to X.value each time we encounter
another occurrence of X.element.
b Apply apriori algorithm on the following set of transaction with min_sup=3 and A CO5 8
) min_conf=80%.

TID Item Sets


T1 A,B,C,D,E
T11 B,C,D
T21 A,B,D,E
T31 A,C,E,D,B
T41 C,B,D,E
T51 D,B,E
T61 D,C
T71 B,A,C
T81 D,E,A
T91 D,B

Generate the association rules with single item on the left hand and right of
association rule. Compute the rule that has highest confidence.
(Step 1:1 Mark)
(Step 2:1 Mark)
(Step 3: 1 Mark)
(Step 4:1 Mark)
Association rule: (set 1,2,3,4) : 3 Marks
Strong Association rule:1 Mark

i) Consider the following Set as basket, the words as item and the support AP CO5 8
5 threshold s = 3. Analyze the given basket to design a market basket model for
singleton set, doubleton set, triple set.
1.{Cat, and, dog, bites}
2. {Yahoo, news, claims, a, cat, mated, with, a, dog, and, produced,
viable, offspring}
3. {Cat, killer, likely, is, a, big, dog}
4. {Professional, free, advice, on, dog, training, puppy, training}
5. {Cat, and, kitten, training, and, behavior}
6. {Dog, &, Cat, provides, dog, training, in, Eugene, Oregon}
7. {“Dog, and, cat”, is, a, slang, term, used, by, police, officers, for, a,
male– female, relationship}
8. {Shop, for, your, show, dog, grooming, and, pet, supplies}
Ans:
The empty set is a subset of any set, the support for ∅ is 8. However, we
shall not generally concern ourselves with the empty set, since it tells us
nothing. Among the singleton sets, obviously {cat} and {dog} are
quite frequent. “Dog” appears in all but basket (5), so its support is
7, while “cat” appears in all but (4) and (8), so its support is 6. The
word “and” is also quite frequent; it appears in (1), (2), (5), (7), and
(8), so its support is 5. The words “a” and “training” appear in three
sets, while “for” and “is” appear in two each. No other word
appears more than once. Suppose that we set our threshold at s = 3.
Then there are five frequent singleton itemsets: {dog}, {cat},
{and}, {a}, and {training}. Now, let us look at the doubletons. A
doubleton cannot be frequent unless both items in the set are
frequent by themselves. Thus, there are only ten possible frequent
doubletons. [4 Marks]
training a and cat dog 4, 6 2, 3, 7 1, 2, 7, 8 1, 2, 3, 6, 7 cat 5, 6 2, 3, 7 1, 2,
5, 7 and 5 2, 7 a none
For example, we see from the table of Fig. 6.2 that doubleton{dog,
training} appears only in baskets (4) and (6). Therefore, its support
is 2, and it is not frequent. There are five frequent doubletons if s =
3; they are
{dog, a} {dog, and} {dog, cat} {cat, a} {cat, and}
Each appears at least three times; for instance, {dog, cat} appears
five times. Next, let us see if there are frequent triples. In order to
be a frequent triple, each pair of elements in the set must be a
frequent doubleton. For example, {dog, a, and} cannot be a
frequent itemset, because if it were, then surely {a, and} would be
frequent, but it is not. The triple {dog, cat, and} might be frequent,
because each of its doubleton subsets is frequent. Unfortunately, the
three words appear together only in baskets (1) and (2), so there are
in fact no frequent triples. The triple{dog, cat, a}might be
frequent, since its doubletons are all frequent. In fact, all three
words do appear in baskets (2), (3), and (7), so it is a frequent triple.
No other triple of words is even a candidate for being a frequent
triple, since for no other triple of words are its three doubleton
subsets frequent. As there is only one frequent triple, there can be
no frequent quadruples or larger sets. [4 Marks]
Enumerate the value and estimate the rules to divide the following values CO2
ii) AP 8
. . 1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0 in to buckets using Datar-
Gionis-Indyk-Motwani Algorithm
Ans:
The simplest case of an algorithm called DGIM. This version of the algorithm
uses O(log2 N) bits to represent a window of N bits, and allows us to estimate
the number of 1’s in the window with an error of no more than 50%.[4 Marks]

[4 Marks]
6 i) Consider a generalization of the problem of counting distinct elements in a AP CO2 16
stream. The problem, called computing “moments. Find and formulate
probability s/n for Second Moment using Alon-Matias-Szegedy Algorithm.
Ans:
Estimate the second moment of the stream using a limited amount of
space; the more space we use, the more accurate the estimate will be. We
compute some number of variables. For each variable X, we store:
1. A particular element of the universal set, which we refer to as X.element, and
2. An integer X.value, which is the value of the variable. To determine the value
of a variable X, we choose a position in the stream between 1 and n, uniformly
and at random. Set X.element to be the element found there, and initialize
X.value to 1. As we read the stream, add 1 to X.value each time we encounter
another occurrence of X.element.
When we reach position 3, we find element c, so we set X1.element =
c and X1.value = 1. Position 4 holds b, so we do not change X1.
Likewise, nothing happens at positions 5 or 6. At position 7, we see c
again, so we set X1.value = 2. At position 8 we find d, and so set
X2.element = d and X2.value = 1. Positions 9 and 10 hold a and b, so they
do not affect X1 or X2. Position 11 holds d so we set X2.value = 2, and
position 12 holds c so we set X1.value = 3. At position 13, we find
element a, and so set X3.element = a and X3.value = 1. Then, at position
14 we see another a and so set X3.value = 2. Position 15, with element b
does not affect any of the variables, so we are done, with final values
X1.value = 3 and X2.value = X3.value = 2. [8 Marks]
We can derive an estimate of the second moment from
any variable X. This estimate is n(2X.value −1).
[8 Marks]

You might also like