You are on page 1of 47

The Flajolet-Martin Algorithm

The Flajolet-Martin Algorithm


The Flajolet-Martin Algorithm
The Flajolet-Martin Algorithm
The Flajolet-Martin Algorithm
● Calculate Hash function, h (x)
The Flajolet-Martin Algorithm
The Flajolet-Martin Algorithm
● Calculate binary bit for every calculated Hash function
The Flajolet-Martin Algorithm
● Find out the trailing zeros
The Flajolet-Martin Algorithm
● Find out the trailing zeros
The Flajolet-Martin Algorithm
● Find out the distinct element
The Alon-Matias-Szegedy Algorithm for Second
Moments
● We compute some number of variables. For each variable X, we
store:

○ 1. A particular element of the universal set, which we refer to as X.element ,and

○ 2. An integer X.value, which is the value of the variable. To determine the value of a
variable X, we choose a position in the stream between 1 and n, uniformly and at
random. Set X.element to be the element found there, and initialize X.value to 1. As
we read the stream, add 1 to X.value each time we encounter another occurrence
of X.element .
The Alon-Matias-Szegedy Algorithm for Second
Moments
● Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c,
a, a, b.
● The length of the stream is n = 15.
a : 5 times,
b : 4 times
c : 3 times
d : 3 times
● the second moment for the stream is
52 + 42 + 32 + 32 = 59
Exercise: Calculate the
surprise number using Alon-
Matias-Szegedy Algorithm
● 5,5,5,5,5
● 9,9,1,1,5
● 10,9,9,9,9,9,9,9,9,9,9
Answer: Calculate the
surprise number using Alon-
Matias-Szegedy Algorithm
● 5,5,5,5,5 = 52 = 25
● 9,9,1,1,5 =22 + 22 + 12 = 9
● 10,9,9,9,9,9,9,9,9,9,9 =12 + 102 =101
The Datar-Gionis-Indyk-Motwani
Algorithm
The Datar-Gionis-Indyk-Motwani
Algorithm
The Datar-Gionis-Indyk-Motwani
Algorithm
The Datar-Gionis-Indyk-Motwani
Algorithm
Maintaining the DGIM Conditions
Maintaining the DGIM Conditions
AACS1573
Introduction to
Data Science
Chapter 5
Mining Data Stream
(Part B)
We will learn about…

6.1 6.2
PageRank Computation
of PageRank
6.1
PageRank
next…
Crawling
Crawling
Crawling
Early Search Engines and Term Spam

● Largely, they worked by crawling the Web and


listing the terms (words or other strings of
characters other than white space) found in each
page, in an inverted index.
● Inverted index = data structure that makes it easy,
given a term, to find (pointers to) all the places
where that term occurs.
Early Search Engines and Term Spam

● Unethical
people saw the
opportunity to
fool search
engines into
leading people
to their page.
Early Search Engines and Term Spam
● Thus, if you were selling shirts on the Web, all you
cared about was that people would see your page,
regardless of what they were looking for.
● Thus, you could add a term like “movie” to your
page, and do it thousands of times, so a search
engine would think you were a terribly important
page about movies.
● When a user issued a search query
with the term “movie,” the search
engine would list your page first.
Early Search Engines and Term Spam
● Techniques for fooling search engines into believing your page is
about SOMETHING IT IS NOT = term spam!
● To combat term spam, Google introduced two innovations.

Solution 1: PageRank was used to simulate where Web surfers, starting at a random
page, would tend to congregate if they followed randomly chosen outlinks from the
page at which they were currently located, and this process were allowed to iterate
many times. Pages that would have a large number of surfers were considered more
“important” than pages that would rarely be visited. Google prefers important pages
to unimportant pages when deciding which pages to show first in response to a
search query.
Early Search Engines and Term Spam

○ Solution 2: The content of a page was judged not only by the terms appearing on
that page, but by the terms used in or near the links to that page. Note that while it
is easy for a spammer to add false terms to a page they control, they cannot as
easily get false terms added to the pages that link to their own page, if they do not
control those pages.
Important pages
Important pages
Important pages
Definition of PageRank
● PageRank = function that assigns a real number to each page in the
Web
● The intent is that the higher the PageRank of a page, the more
“important” it is.
● Think of the Web as a directed graph, where pages are the nodes,
and there is an arc from page p1 to page p2 if there are one or more
links from p1 to p2.
Definition of PageRank
● Page A has links to each of the other three pages; page B has links
to A and D only; page C has a link only to A, and page D has links to
B and C only.
Definition of PageRank
● Suppose a random surfer starts at page A in Fig. 5.1.
● There are links to B, C, and D, so this surfer will next be at each of those
pages with probability 1/3, and has zero probability of being at A.
● A random surfer at B has, at the next step, probability 1/2 of being at A, 1/2 of
being at D, and 0 of being at B or C.

● Transition matrix of the Web = what happens


to random surfers after one step.
● This matrix M has n rows and columns, if there
are n pages.
● The element mij in row i and column j has
value 1/k if page j has k arcs out, and one of
them is to page i. Otherwise, mij = 0.
Definition of PageRank
● The transition matrix for the Web of Fig. 5.1 is

● In this matrix, the order of the pages is the natural one, A, B, C, and D. Thus, the
first column expresses the fact, already discussed, that a surfer at A has a 1/3
probability of next being at each of the other pages. The second column expresses
the fact that a surfer at B has a 1/2 probability of being next at A and the same of
being at D. The third column says a surfer at C is certain to be at A next. The last
column says a surfer at D has a 1/2 probability of being next at B and the same at C.
6.2
Computation
of PageRank
next…
Computation of PageRank

● The matrix for the graph of figure above is


Computation of PageRank
● The rows and columns correspond to A, B, and D
in that order.
● To get the PageRanks for this matrix, we start
with a vector with all components equal to 1/3, and
repeatedly multiply by M.
● The sequence of vectors we get is

● We now know that the PageRank of A is 2/9, the


PageRank of B is 4/9, and the PageRank of D is 3/9.
We have learned about…

6.1 6.2
PageRank Computation
of PageRank
Coming next…
Chapter 6 Case Study (Part B)

You might also like