Professional Documents
Culture Documents
○ 2. An integer X.value, which is the value of the variable. To determine the value of a
variable X, we choose a position in the stream between 1 and n, uniformly and at
random. Set X.element to be the element found there, and initialize X.value to 1. As
we read the stream, add 1 to X.value each time we encounter another occurrence
of X.element .
The Alon-Matias-Szegedy Algorithm for Second
Moments
● Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c,
a, a, b.
● The length of the stream is n = 15.
a : 5 times,
b : 4 times
c : 3 times
d : 3 times
● the second moment for the stream is
52 + 42 + 32 + 32 = 59
Exercise: Calculate the
surprise number using Alon-
Matias-Szegedy Algorithm
● 5,5,5,5,5
● 9,9,1,1,5
● 10,9,9,9,9,9,9,9,9,9,9
Answer: Calculate the
surprise number using Alon-
Matias-Szegedy Algorithm
● 5,5,5,5,5 = 52 = 25
● 9,9,1,1,5 =22 + 22 + 12 = 9
● 10,9,9,9,9,9,9,9,9,9,9 =12 + 102 =101
The Datar-Gionis-Indyk-Motwani
Algorithm
The Datar-Gionis-Indyk-Motwani
Algorithm
The Datar-Gionis-Indyk-Motwani
Algorithm
The Datar-Gionis-Indyk-Motwani
Algorithm
Maintaining the DGIM Conditions
Maintaining the DGIM Conditions
AACS1573
Introduction to
Data Science
Chapter 5
Mining Data Stream
(Part B)
We will learn about…
6.1 6.2
PageRank Computation
of PageRank
6.1
PageRank
next…
Crawling
Crawling
Crawling
Early Search Engines and Term Spam
● Unethical
people saw the
opportunity to
fool search
engines into
leading people
to their page.
Early Search Engines and Term Spam
● Thus, if you were selling shirts on the Web, all you
cared about was that people would see your page,
regardless of what they were looking for.
● Thus, you could add a term like “movie” to your
page, and do it thousands of times, so a search
engine would think you were a terribly important
page about movies.
● When a user issued a search query
with the term “movie,” the search
engine would list your page first.
Early Search Engines and Term Spam
● Techniques for fooling search engines into believing your page is
about SOMETHING IT IS NOT = term spam!
● To combat term spam, Google introduced two innovations.
Solution 1: PageRank was used to simulate where Web surfers, starting at a random
page, would tend to congregate if they followed randomly chosen outlinks from the
page at which they were currently located, and this process were allowed to iterate
many times. Pages that would have a large number of surfers were considered more
“important” than pages that would rarely be visited. Google prefers important pages
to unimportant pages when deciding which pages to show first in response to a
search query.
Early Search Engines and Term Spam
○ Solution 2: The content of a page was judged not only by the terms appearing on
that page, but by the terms used in or near the links to that page. Note that while it
is easy for a spammer to add false terms to a page they control, they cannot as
easily get false terms added to the pages that link to their own page, if they do not
control those pages.
Important pages
Important pages
Important pages
Definition of PageRank
● PageRank = function that assigns a real number to each page in the
Web
● The intent is that the higher the PageRank of a page, the more
“important” it is.
● Think of the Web as a directed graph, where pages are the nodes,
and there is an arc from page p1 to page p2 if there are one or more
links from p1 to p2.
Definition of PageRank
● Page A has links to each of the other three pages; page B has links
to A and D only; page C has a link only to A, and page D has links to
B and C only.
Definition of PageRank
● Suppose a random surfer starts at page A in Fig. 5.1.
● There are links to B, C, and D, so this surfer will next be at each of those
pages with probability 1/3, and has zero probability of being at A.
● A random surfer at B has, at the next step, probability 1/2 of being at A, 1/2 of
being at D, and 0 of being at B or C.
● In this matrix, the order of the pages is the natural one, A, B, C, and D. Thus, the
first column expresses the fact, already discussed, that a surfer at A has a 1/3
probability of next being at each of the other pages. The second column expresses
the fact that a surfer at B has a 1/2 probability of being next at A and the same of
being at D. The third column says a surfer at C is certain to be at A next. The last
column says a surfer at D has a 1/2 probability of being next at B and the same at C.
6.2
Computation
of PageRank
next…
Computation of PageRank
6.1 6.2
PageRank Computation
of PageRank
Coming next…
Chapter 6 Case Study (Part B)