22 views

Uploaded by quantized1

save

- Hashing (DASTAL)
- Real Estate
- Psyc 60 Central Tendency and Variability_2.ppt
- Unit II, Sampling
- 36_1
- Mean
- size_distribution.pdf
- class03_13
- Session 7 (Chpt 5 & 6)(3)
- 4R6Lab1 GM Operation 2017
- Biol171L Heights
- Normal Distribution
- Sis Mica 1212
- Hashing
- TQM - Statistical Process Control
- Tea Stall With Page
- Chapter 04
- Lesson Plan 5
- MA125_Fall_2011
- dialog part gdd
- Statistics Study Guide - Variance
- chapter07-091211160423-phpapp02
- Project Cost Risk Analysis
- Quality Control Ch 2 Solutions
- Uncertainty Analysis of Penicillin v Production Using Monte Carlo Simulation.
- Pseudorandom, Authenticated Methodologies
- Intermediate GIS Skills With ArcGIS 10 Tutorial
- Arithmetic Mean
- Value Engineering (7)
- iemh114
- Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
- Dispatches from Pluto: Lost and Found in the Mississippi Delta
- The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
- Sapiens: A Brief History of Humankind
- Yes Please
- The Unwinding: An Inner History of the New America
- Grand Pursuit: The Story of Economic Genius
- This Changes Everything: Capitalism vs. The Climate
- A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
- The Emperor of All Maladies: A Biography of Cancer
- The Prize: The Epic Quest for Oil, Money & Power
- John Adams
- Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
- The World Is Flat 3.0: A Brief History of the Twenty-first Century
- Rise of ISIS: A Threat We Can't Ignore
- The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
- Smart People Should Build Things: How to Restore Our Culture of Achievement, Build a Path for Entrepreneurs, and Create New Jobs in America
- Team of Rivals: The Political Genius of Abraham Lincoln
- The New Confessions of an Economic Hit Man
- How To Win Friends and Influence People
- Angela's Ashes: A Memoir
- Steve Jobs
- Bad Feminist: Essays
- You Too Can Have a Body Like Mine: A Novel
- The Incarnations: A Novel
- The Light Between Oceans: A Novel
- The Silver Linings Playbook: A Novel
- Leaving Berlin: A Novel
- Extremely Loud and Incredibly Close: A Novel
- The Sympathizer: A Novel (Pulitzer Prize for Fiction)
- A Man Called Ove: A Novel
- The Master
- Bel Canto
- The Blazing World: A Novel
- The Rosie Project: A Novel
- The First Bad Man: A Novel
- We Are Not Ourselves: A Novel
- Brooklyn: A Novel
- The Flamethrowers: A Novel
- Life of Pi
- The Love Affairs of Nathaniel P.: A Novel
- Lovers at the Chameleon Club, Paris 1932: A Novel
- The Bonfire of the Vanities: A Novel
- The Perks of Being a Wallflower
- A Prayer for Owen Meany: A Novel
- The Cider House Rules
- Wolf Hall: A Novel
- The Art of Racing in the Rain: A Novel
- The Wallcreeper
- Interpreter of Maladies
- The Kitchen House: A Novel
- Beautiful Ruins: A Novel
- Good in Bed

You are on page 1of 7

American Scientist

the magazine of Sigma Xi, The Scientific Research Society

This reprint is provided for personal and noncommercial use. For any other use, please send a request Brian Hayes by

electronic mail to bhayes@amsci.org.

274 American Scientist, Volume 96

Computing Science

© 2008 Brian Hayes. Reproduction with permission only.

Contact bhayes@amsci.org.

The Britney Spears Problem

Brian Hayes

B

ack in 1999, the operators of

the Lycos Internet portal began

publishing a weekly list of the 50 most

popular queries submitted to their Web

search engine. Britney Spears—initially

tagged a “teen songstress,” later a “pop

tart”—was No. 2 on that first weekly

tabulation. She has never fallen off the

list since then—440 consecutive ap-

pearances when I last checked. Other

perennials include Pamela Anderson

and Paris Hilton. What explains the en-

during popularity of these celebrities,

so famous for being famous? That’s a

fascinating question, and the answer

would doubtless tell us something

deep about modern culture. But it’s not

the question I’m going to take up here.

What I’m trying to understand is how

we can know Britney’s ranking from

week to week. How are all those que-

ries counted and categorized? What

algorithm tallies them up to see which

terms are the most frequent?

One challenging aspect of this task

is simply coping with the volume of

data. Lycos reports processing 12 mil-

lion queries a day, and other search en-

gines, such as Google, handle orders of

magnitude more. But that’s only part

of the problem. After all, if you have

the computational infrastructure to an-

swer all those questions about Britney

and Pamela and Paris, then it doesn’t

seem like much of an added burden

to update a counter each time some

fan submits a request. What makes the

counting difficult is that you can’t just

pay attention to a few popular subjects,

because you can’t know in advance

which ones are going to rank near the

top. To be certain of catching every

new trend as it unfolds, you have to

monitor all the incoming queries—and

their variety is unbounded.

In the past few years the tracking

of hot topics has itself become a hot

topic in computer science. Algorithms

for such tasks have a distinctive fea-

ture: They operate on a continuous

and unending stream of data, rather

than waiting for a complete batch of

information to be assembled. Like a

worker tending a conveyor belt, the al-

gorithm has to process each element of

the stream in sequence, as soon as it ar-

rives. Ideally, all computations on one

element are finished before the next

item comes along.

Much of the new interest in stream

algorithms is inspired by the Internet,

where streams of many kinds flow co-

piously. It’s not just a matter of search-

engine popularity contests. A similar

algorithm can help a network man-

ager monitor traffic patterns, reveal-

ing which sites are generating most of

the volume. The routers and switches

that actually direct the traffic also rely

on stream algorithms, passing along

each packet of data before turning to

the next. A little farther afield, services

that filter spam from e-mail can use

stream algorithms to detect messages

sent in thousands or millions of identi-

cal copies.

Apart from the Internet, stream algo-

rithms are also being applied to flows

of financial data, such as stock-market

transactions and credit-card purchases.

If some government agency wanted to

monitor large numbers of telephone

calls, they too might have an interest in

stream algorithms. Finally, the design-

ers of software for some large-scale

scientific experiments adopt a stream-

oriented style of data processing. De-

tectors at particle-physics laboratories

produce so much data that no machine

can store it all, even temporarily, and

so preliminary filtering is done by pro-

grams that analyze signals on the fly.

Stream Gauges

Search-engine queries and Internet

packets are fairly complicated data

structures, but the principles of stream

algorithms can be illustrated with

simple streams of numbers. Suppose a

pipeline delivers an endless sequence

of nonnegative integers at a steady rate

of one number every t time units. We

want to build a device—call it a stream

gauge—that intercepts the stream and

displays answers to certain questions

about the numbers.

What computational resources does

a stream gauge need to accomplish its

task? The resources of particular con-

cern are processing time—which can’t

exceed t units per stream element—

and auxiliary storage space. (It’s con-

venient to assume that numbers of any

size can be stored in one unit of space,

and that any operation of ordinary

arithmetic, such as addition or multi-

plication, can be performed in one unit

of time. Comparing two numbers also

takes a single time unit.)

For some questions about number

streams, it’s quite easy to design an

effective stream gauge. If you want to

know how many stream elements have

been received, all you need is a coun-

ter. The counter is a storage location,

or register, whose initial value is set to

0. Each time an element of the stream

arrives, the counter is incremented: In

other words, if the current value of the

counter is x, it becomes x+1.

Brian Hayes is senior writer for American Scien-

tist. A collection of his essays, Group Theory in

the Bedroom, and Other Mathematical Diver-

sions, was published in April by Hill and Wang.

Additional material related to the “Computing Sci-

ence” column appears in Hayes’s Weblog at http://

bit-player.org. Address: 211 Dacian Avenue, Dur-

ham, NC 27701. Internet: brian@bit-player.org

Tracking who’s hot

and who’s not

presents an

algorithmic challenge

2008 July–August 275 www.americanscientist.org

© 2008 Brian Hayes. Reproduction with permission only.

Contact bhayes@amsci.org.

Instead of counting the stream ele-

ments, another kind of gauge displays

the sum of all the integers received.

Again we need a register x initialized

to 0; then, on the arrival of each integer

n, we add n to the running total in x.

Still another easy-to-build gauge

shows the maximum value of all the

integers seen so far. Yet again x begins

at 0; whenever a stream element n

is greater than x, that n becomes the

new x.

For each of these devices, the stor-

age requirement is just a single regis-

ter, and the processing of each stream

element is completed in a single unit

of time. Computing doesn’t get much

easier than that. Certain other stream

functions require only slightly greater

effort. For example, calculating the

average (or arithmetic mean) of a

stream’s integers takes three registers

and three operations per integer. One

register counts the elements, another

records their sum, and the third reg-

ister holds the value of the average,

calculated as the quotient of the sum

and the count.

What matters most about the frugal

memory consumption of these algo-

rithms is not the exact number of reg-

isters needed; what matters is that the

storage space remains constant no mat-

ter how long the stream. In effect, the

algorithms can do calculations of any

length on a single sheet of scratch pa-

per. Their time performance follows a

similar rule: They use the same number

of clock cycles for each element of the

stream. The algorithms are said to run

in constant space and constant time.

Logjam

Not every property of a stream can

be computed in constant space and

time. Consider a device that reports

the number of distinct elements in a

stream of integers. (For example, al-

though the stream 3, 1, 8, 3, 1, 4, 3 has a

total of seven elements, it has only four

distinct elements: 3, 1, 8 and 4.)

There’s a straightforward method

for counting distinct elements: Start a

counter at 0 and increment the counter

whenever the stream delivers a num-

ber you haven’t seen before. But how

do you know that a given number is

making its first appearance? The obvi-

ous answer is to keep a list of all the in-

tegers you’ve encountered so far, and

compare each incoming value n with

the numbers in the stored set. If n is

already present in the list, ignore the

new copy. If n is not found, append

it to the stored set and increment the

counter.

The trouble is, this is not a constant-

space algorithm. The set of stored

values can grow without limit. In the

worst case—where every element of

the stream is unique—the size of the

stored set is equal to the length of the

stream. No fixed amount of memory

can satisfy this appetite.

The problem of counting distinct

elements is closely related to the

problem of identifying the most-fre-

quent elements in a stream. Having

a solution to one problem gives us a

head start on the other. The distinct-

elements algorithm needs to keep a

list indicating which stream elements

have been observed so far. The fre-

quent-elements algorithm needs the

same list, but instead of simply noting

whether or not an element is present,

the program must maintain a counter

for each element, giving its frequency.

Since the list of elements and their

counters can grow without limit, de-

mands for space are again potentially

unbounded.

Both of the algorithms could run out

of time as well as space, depending on

details of implementation. If it’s neces-

sary to search through all the stored

numbers one by one for every newly

received stream element, then the time

needed grows in proportion to the size

of the stored set. Eventually, the delay

must exceed t, the interval between

stream elements, at which point the

WWE

1

2

3

4

5

6

8

10

15

20

25

30

40

50

L

y

c

o

s

5

0

r

a

n

k

360 370 380 390 400 410 420 430 440

week

Britney

Spears

Clay

Aiken

Jessica

Simpson

Paris

Hilton

poker

Pamela

Anderson

The Lycos 50 rankings of popular Web search terms have their ups and downs, but they also reveal an extraordinary persistence of interest in a

small constellation of starlets and other pop-culture figures, as well as entertainments such as poker and professional wrestling (WWE). The graph

records the trajectories of a few subjects that have made the list almost every week throughout the eight-year history of the rankings. The time

span shown is an 87-week period from September 2006 to May 2008; week numbers count from the first survey in 1999; the two gray bars mark

holiday weeks when no rankings were published. The logarithmic scale is meant to more clearly delineate the highest-ranking categories. (Psy-

chologically, the difference between first and second is greater than the difference between 31st and 32nd.) Data are from http://50.lycos.com/

276 American Scientist, Volume 96

© 2008 Brian Hayes. Reproduction with permission only.

Contact bhayes@amsci.org.

algorithm can no longer keep up with

the incoming data. Other data struc-

tures, such as trees and indexed arrays,

allow for quicker access, although oth-

er time constraints may remain.

Majority Rules

Stream algorithms that require more

than a constant amount of storage

space are seldom of practical use in

large-scale applications. Unfortunate-

ly, for tasks such as counting distinct

elements and finding most-frequent

elements, there is really no hope of cre-

ating a constant-space algorithm that’s

guaranteed always to give the correct

answer. But before we grow too de-

spondent about this bleak outlook, I

should point out that there are also a

few pleasant surprises in the world of

stream algorithms.

Although identifying the most fre-

quent item in a stream is hard in gener-

al, there is an ingenious way of doing it

in one special case—namely, when the

most common item is so popular that

it accounts for a majority of the stream

entries (more than half the elements).

The algorithm that accomplishes this

task requires just two registers, and

it runs in a constant amount of time

per stream element. (Before reading

on you might want to try constructing

such an algorithm for yourself.)

The majority-finding algorithm uses

one of its registers for temporary stor-

age of a single item from the stream;

this item is the current candidate for

majority element. The second register

is a counter initialized to 0. For each

element of the stream, we ask the al-

gorithm to perform the following rou-

tine. If the counter reads 0, install the

current stream element as the new ma-

jority candidate (displacing any other

element that might already be in the

register). Then, if the current element

matches the majority candidate, incre-

ment the counter; otherwise, decre-

ment the counter. At this point in the

cycle, if the part of the stream seen so

far has a majority element, that ele-

ment is in the candidate register, and

the counter holds a value greater than

0. What if there is no majority element?

Without making a second pass through

the data—which isn’t possible in a

stream environment—the algorithm

cannot always give an unambiguous

answer in this circumstance. It merely

promises to correctly identify the ma-

jority element if there is one.

The majority algorithm was invent-

ed in 1982 by Michael E. Fischer and

Steven L. Salzberg of Yale Universi-

ty. The version I have described here

comes from a 2002 article by Erik D.

Demaine of MIT and Alejandro López-

Ortiz and J. Ian Munro of the Univer-

sity of Waterloo in Canada. Demaine

and his colleagues have extended the

algorithm to cover a more-general

problem: Given a stream of length n,

identify a set of size m that includes

all the elements occurring with a fre-

quency greater than n/(m+1). (In the

case of m= 1, this reduces to the major-

ity problem.) The extended algorithm

requires m registers for the candidate

elements as well as m counters. The

basic scheme of operation is analogous

to that of the majority algorithm. When

a stream element matches one of the

candidates, the corresponding counter

is incremented; when there is no match

to any candidate, all of the counters

are decremented; if a counter is at 0,

the associated candidate is replaced by

a new element from the stream.

Again the results carry only a weak

guarantee: If any elements of the

stream exceed the threshold frequency,

those elements will appear among the

candidates, but not all the candidates

are necessarily above the threshold.

Even with this drawback, the algo-

rithm performs impressive feats, such

as scanning a stream of Web search

queries for all terms that make up at

least 1 percent of the traffic.

Getting It Almost Right

For many stream problems of practical

interest, computing exact answers is

simply not feasible—but it’s also not

1 3 5 1 0 7 2 3 8 6

count

sum mean

}

0

0

6 1 3 5 1 0 7 2 3 8 6

count

sum mean

}

1

6

6.0

9 6 6 1 3 5 1 0 7 2 3 8 6

count

sum mean

}

3

17

5.66

8 8 1 5 1 0 4 9 6 6 1 3 5 1 0 7 2 3 8 6

count

sum mean

}

10

36

3.6

stream

source

stream

source

stream

source

stream

source

A stream algorithm must process its input on the fly, in sequence, one element at a time. Even

so, some computations on streams are easy to implement. The example shown here is the com-

putation of the average, or arithmetic mean. The stream elements are integers emitted from

the source at the left; they flow to the right and are read as they pass by the averaging device;

thereafter they cannot be examined again. The algorithm counts the elements and sums them

in two registers of auxiliary storage; the output is the quotient of these quantities, computed

in a third register (red border). At all stages of the computation the output is the mean of all the

integers in the segment of the stream seen so far. Streams of any length can be handled with

only a fixed amount of memory.

2008 July–August 277 www.americanscientist.org

© 2008 Brian Hayes. Reproduction with permission only.

Contact bhayes@amsci.org.

necessary. A good estimate, or an an-

swer that’s probably correct, will serve

just fine.

One simple approach to approxima-

tion is to break a stream into blocks,

turning a continuous process into a se-

ries of batch operations. Within a block,

you are not limited to algorithms that

obey the one-pass-only rule. On the

other hand, answers derived from a

sequence of blocks only approximate

answers for the stream as a whole.

For example, in a count of distinct ele-

ments, an item counted in two suc-

cessive blocks would be counted only

once if the blocks were combined.

A variant of the block idea is the

sliding window. A window of length k

holds the most recent k elements from

the stream, with each new arrival dis-

placing the oldest item in the queue.

A great deal of ingenuity has been

applied to the search for better approx-

imate stream algorithms, and there are

successes to report. Here I shall briefly

mention work in two areas, based on

sampling and on hashing.

When the items that interest you are

the most frequent ones in a stream,

statistics is on your side; any sample

drawn from the stream is most likely to

include those very items. Indeed, any

random sample represents an approxi-

mate solution to the most-common-

elements problem. But the simplest

sampling strategy is not necessarily

the most efficient or the most accurate.

In 2002 Gurmeet Singh Manku and

Rajeev Motwani of Stanford Univer-

sity described two methods they called

sticky sampling and lossy counting.

Suppose you want to select a rep-

resentative sample of 100 items from

a stream. If the stream consists of just

100 elements, the task is easy: Take

all the elements, or in other words

select them with probability 1. When

the stream extends to 200 elements,

you can make a new random selection

with probability 1/2; at 400 elements,

the correct probability is 1/4, and so

on. Manku and Motwani propose a

scheme for continually readjusting the

selection probability without having

to start over with a new sample. Their

sticky-sampling method refines the se-

lection each time the stream doubles

in length, maintaining counters that

estimate how often each item appears

in the full stream. The algorithm solves

the most-frequent-elements problem

in constant space, but only in a proba-

bilistic sense. If you run the algorithm

8 8 1 5 1 0 4 9 6 7 6 3 5 1 3 8 5 5 3 5

stream

source

4 9 6 7 6 3 5 1 3 8 5 5 3 5

stream

source

store

.

.

.

5

3

counts

.

.

.

3

1

store

.

.

.

5

3

6

1

8

counts

.

.

.

4

3

1

1

1

8 8 1 5 1 0 4 9 6 7 6 3 5 1 3 8 5 5 3 5

stream

source

4 9 6 7 6 3 5 1 3 8 5 5 3 5

stream

source

6 3 5 1 3 8 5 5 3 5

store

distinct

elements

0

stream

source

.

.

.

store

distinct

elements

2

.

.

.

store

distinct

elements

5

.

.

.

3

5

6

1

8

3

5

Unbounded memory requirements make some stream algorithms intractable. Counting the

number of distinct elements in a stream (above) seems easy at first: Just increment a counter on

the first appearance of each element. But knowing whether or not an element is new requires

keeping a record of those already seen. Each new element is added to the front of a list; the

output is the length of the list, which can grow without limit. An algorithm for identifying the

most frequent items in a stream (below) is similar but includes a counter for each item. A further

complication is that the output is not a single number but an entire data structure. In the version

shown here the lists of items and counters are kept sorted with the most frequent items first.

278 American Scientist, Volume 96

© 2008 Brian Hayes. Reproduction with permission only.

Contact bhayes@amsci.org.

twice on the same data, the results will

likely differ.

Lossy counting is based on a similar

idea of continually refining a sample as

the stream lengthens, but in this case

the algorithm is deterministic, mak-

ing no use of random choices. Each

stream element is checked against a

stored list; if the element is already

present, a counter is incremented; oth-

erwise a new entry is added. To keep

the list from growing uncontrollably,

it is periodically purged of infrequent

elements. Lossy counting is not guar-

anteed to run in constant space, but

Manku and Motwani report that in

practice it performs better than sticky

sampling.

Making A Hash of It

Hashing—one of the more aptly

named techniques of computer sci-

ence—is a way of scrambling data so

that the result looks random. Oddly

enough, this turns out to be a good

way of making sense of the data.

A completely random stream might

seem hard to analyze, but in fact ran-

domness opens the door to some

simple but powerful statistical meth-

ods. Imagine a stream composed of

numbers chosen at random from some

fixed range of integers, say 1 through

n. For this stream the number of dis-

tinct elements observed follows a

highly predictable course; in particu-

lar, all n distinct elements should have

appeared at least once by the time the

stream reaches length n log n.

This kind of analysis won’t work for

most streams of real data because the

elements are not drawn from a con-

venient fixed range, and they are not

random—unless you compose search-

engine queries by blindly pecking at

the keyboard. This is where hashing

comes in. A hash function transforms

a data item into a more-or-less random

value uniformly distributed over some

interval. Think of hashing as the action

of a mad postal clerk sorting letters into

bins. The clerk’s rule says that if two let-

ters have the same address, they go into

the same bin, but in all other respects

the sorting might as well be random.

By randomly distributing all the ele-

ments of a stream into a fixed number

of bins, hashing allows for easy statis-

tical analysis of the stream’s elements.

For example, the number of distinct

elements can be estimated from the

probability that a given bin remains

empty, or from the minimum number

of elements found in any bin. A recent

paper by Ahmed Metwally and Divya-

kant Agrawal of Ask.com and Amr El

Abbadi of the University of California,

Santa Barbara, evaluates a dozen of

these algorithms.

Gently Down the Stream

Are there better algorithms for stream

problems waiting to be discovered?

Some of the limits of what’s possible

were outlined in 1996 by Noga Alon,

Yossi Matias and Mario Szegedy, who

were all then at AT&T Bell Laborato-

ries. They framed the issue in terms of

an infinite series of frequency moments,

analogous to the more familiar statisti-

cal moments (mean, variance, and so

on). The zero-order frequency moment

An algorithm that uses just two memory registers can identify the most common element

of a stream in the special case where that value makes up more than half the elements. The

rules state that if the counter reads 0, the current element should be installed in the candidate

register. Then, if the current element matches the candidate, increment the counter; otherwise

decrement it. In the example worked out here, the device correctly shows that 6 is a majority

item for the first three elements of the stream, and 4 is a majority of the first seven elements.

count candidate

3 4

stream

source

count candidate

2 4

stream

source

count candidate

1 4

stream

source

count candidate

0 6

stream

source

count candidate

1 6

stream

source

count candidate

0 4

stream

source

count candidate

1 4

stream

source

count candidate

0

stream

source

2008 July–August 279 www.americanscientist.org

© 2008 Brian Hayes. Reproduction with permission only.

Contact bhayes@amsci.org.

F

0

is the number of distinct elements

in the stream; the first-order moment

F

1

is the total number of elements; F

2

is a quantity called the repeat rate; still

higher moments describe the “skew”

of the stream, giving greater emphasis

to the most-frequent elements. All of

the frequency moments are defined as

sums of powers of m

i

, where m

i

is the

number of times that item i appears in

a stream, and the sum is taken over all

possible values of i.

It’s interesting that calculating F

1

is so

easy. We can determine the exact length

of a stream with a simple, deterministic

algorithm that runs in constant space.

Alon, Matias and Szegedy proved that

no such algorithm exists for any other

frequency moment. We can get an exact

value of F

0

(the number of distinct ele-

ments) only by supplying much more

memory. Even approximating F

0

in

constant space is harder: It requires a

nondeterministic algorithm, one that

makes essential use of randomness.

The same is true of F

2

. For the higher

moments, F

6

and above, there are no

constant-space methods at all.

All this mathematics and algorith-

mic engineering seems like a lot of

work just for following the exploits of

a famous “pop tart.” But I like to think

the effort might be justified. Years

from now, someone will type “Brit-

ney Spears” into a search engine and

will stumble upon this article listed

among the results. Perhaps then a cu-

rious reader will be led into new lines

of inquiry.

Bibliography

Alon, Noga, Yossi Matias and Mario Szegedy.

1996. The space complexity of approximat-

ing the frequency moments. In Proceedings

of the 28th Annual ACM Symposium on Theo-

ry of Computing, pp. 20–29.

Chakrabarti, Amit, Graham Cormode and

Andrew McGregor. 2007. A near-optimal

algorithm for computing the entropy of

a stream. In Proceedings of the 18th Annual

ACM-SIAM Symposium on Discrete Algo-

rithms, pp. 328–335.

Charikar, Moses, Kevin Chen and Martin Far-

ach-Colton. 2002. Finding frequent items in

data streams. In Proceedings of the 29th Inter-

national Colloquium on Automata, Languages

and Programming, pp. 693–702.

Cormode, Graham, and S. Muthukrishnan. 2005.

What’s hot and what’s not: tracking most fre-

quent items dynamically. ACM Transactions

on Database Systems 30(1):249–278.

Demaine, Erik D., Alejandro López-Ortiz and

J. Ian Munro. 2002. Frequency estimation of

internet packet streams with limited space.

In Proceedings of the 10th Annual European

Symposium on Algorithms, pp. 348–360.

Dimitropoulos, Xenofontas, Paul Hurley and

Andreas Kind. 2008. Probabilistic lossy

counting: an efficient algorithm for finding

heavy hitters. ACM SIGCOMM Computer

Communication Review 38(1):7–16.

Estan, Cristian, and George Varghese. 2003.

New directions in traffic measurement and

accounting: Focusing on the elephants, ig-

noring the mice. ACM Transactions on Com-

puter Systems 21(3):270–313.

Fischer, Michael J., and Steven L. Salzberg.

1982. Solution to problem 81–5. Journal of

Algorithms 3(4):375–379.

Karp, Richard M., Scott Shenker, and Christos

H. Papadimitriou. 2003. A simple algorithm

for finding frequent elements in streams

and bags. ACM Transactions on Database

Systems 28(1):51–55.

Manku, Gurmeet Singh, Sridhar Rajagopalan

and Bruce G. Lindsay. 1998. Approximate

medians and other quantiles in one pass

and with limited memory. In Proceedings of

the 1998 ACM SIGMOD International Con-

ference on Management of Data, pp. 426–435.

Manku, Gurmeet Singh, and Rajeev Motwani.

2002. Approximate frequency counts over

data streams. In Proceedings of the 28th Inter-

national Conference on Very Large Data Bases,

pp. 346–357.

Metwally, Ahmed, Divyakant Agrawal and

Amr El Abbadi. 2008. Why go logarithmic

if we can go linear? Towards effective dis-

tinct counting of search traffic. In Proceed-

ings of the 11th International Conference on

Extending Database Technology: Advances in

Database Technology, pp. 618–629.

org. A similar algorithm can help a network manager monitor traffic patterns. can be performed in one unit of time. Algorithms for such tasks have a distinctive feature: They operate on a continuous and unending stream of data. was published in April by Hill and Wang. Contact bhayes@amsci. Stream Gauges © 2008 Brian Hayes. Additional material related to the “Computing Science” column appears in Hayes’s Weblog at http:// bit-player. If some government agency wanted to monitor large numbers of telephone calls. The routers and switches that actually direct the traffic also rely on stream algorithms. all you need is a counter. NC 27701. But that’s only part of the problem. stream algorithms are also being applied to flows of financial data. Address: 211 Dacian Avenue. the designers of software for some large-scale scientific experiments adopt a streamoriented style of data processing. because you can’t know in advance which ones are going to rank near the Brian Hayes is senior writer for American Scientist. and so preliminary filtering is done by programs that analyze signals on the fly. then it doesn’t seem like much of an added burden to update a counter each time some fan submits a request. such as addition or multiplication. Like a worker tending a conveyor belt. Much of the new interest in stream algorithms is inspired by the Internet. all computations on one element are finished before the next item comes along. such as Google. it’s quite easy to design an effective stream gauge. Britney Spears—initially tagged a “teen songstress. Group Theory in the Bedroom. and other search engines. even temporarily. The counter is a storage location. revealing which sites are generating most of the volume. Reproduction with permission only. She has never fallen off the list since then—440 consecutive appearances when I last checked. so famous for being famous? That’s a fascinating question. as soon as it arrives. Suppose a pipeline delivers an endless sequence of nonnegative integers at a steady rate of one number every t time units. What makes the counting difficult is that you can’t just pay attention to a few popular subjects. handle orders of magnitude more. and that any operation of ordinary arithmetic. if you have the computational infrastructure to answer all those questions about Britney and Pamela and Paris.org. To be certain of catching every new trend as it unfolds. If you want to know how many stream elements have been received. it becomes x+1. Other perennials include Pamela Anderson and Paris Hilton. where streams of many kinds flow copiously. Internet: brian@bit-player. but the principles of stream algorithms can be illustrated with simple streams of numbers. What I’m trying to understand is how we can know Britney’s ranking from week to week. Finally. the counter is incremented: In other words. rather than waiting for a complete batch of information to be assembled. whose initial value is set to 0. Comparing two numbers also takes a single time unit. But it’s not the question I’m going to take up here. Ideally.Computing Science The Britney Spears Problem Brian Hayes B ack in 1999. What computational resources does a stream gauge need to accomplish its task? The resources of particular concern are processing time—which can’t exceed t units per stream element— and auxiliary storage space. (It’s convenient to assume that numbers of any size can be stored in one unit of space. How are all those queries counted and categorized? What algorithm tallies them up to see which terms are the most frequent? One challenging aspect of this task is simply coping with the volume of data. services that filter spam from e-mail can use stream algorithms to detect messages sent in thousands or millions of identical copies. and Other Mathematical Diversions. Durham. the algorithm has to process each element of the stream in sequence. 2 on that first weekly tabulation. Search-engine queries and Internet packets are fairly complicated data structures. What explains the enduring popularity of these celebrities. Lycos reports processing 12 million queries a day. We want to build a device—call it a stream gauge—that intercepts the stream and displays answers to certain questions about the numbers. Detectors at particle-physics laboratories produce so much data that no machine can store it all. Volume 96 Tracking who’s hot and who’s not presents an algorithmic challenge top. A little farther afield. or register. It’s not just a matter of searchengine popularity contests. After all.) For some questions about number streams. and the answer would doubtless tell us something deep about modern culture. . Apart from the Internet. they too might have an interest in stream algorithms. if the current value of the counter is x.” later a “pop tart”—was No. passing along each packet of data before turning to the next. such as stock-market transactions and credit-card purchases. Each time an element of the stream arrives. In the past few years the tracking of hot topics has itself become a hot topic in computer science. A collection of his essays.org 274 American Scientist. the operators of the Lycos Internet portal began publishing a weekly list of the 50 most popular queries submitted to their Web search engine. you have to monitor all the incoming queries—and their variety is unbounded.

The trouble is. then. at which point the The Lycos 50 rankings of popular Web search terms have their ups and downs. Contact bhayes@amsci. depending on details of implementation. as well as entertainments such as poker and professional wrestling (WWE). although the stream 3.org © 2008 Brian Hayes. another records their sum. append it to the stored set and increment the counter. Again we need a register x initialized to 0. the algorithms can do calculations of any length on a single sheet of scratch paper.americanscientist. the delay must exceed t. 2008 July–August 275 . the difference between first and second is greater than the difference between 31st and 32nd. Yet again x begins at 0. 8 and 4. and the processing of each stream element is completed in a single unit of time. In effect. calculating the average (or arithmetic mean) of a stream’s integers takes three registers and three operations per integer. Computing doesn’t get much easier than that. Consider a device that reports the number of distinct elements in a stream of integers. No fixed amount of memory can satisfy this appetite. The logarithmic scale is meant to more clearly delineate the highest-ranking categories. If n is not found. whenever a stream element n is greater than x. For example. we add n to the running total in x. One register counts the elements. this is not a constantspace algorithm. and the third register holds the value of the average. In the worst case—where every element of the stream is unique—the size of the stored set is equal to the length of the stream. Certain other stream functions require only slightly greater effort. Both of the algorithms could run out of time as well as space. 1. The distinctelements algorithm needs to keep a list indicating which stream elements have been observed so far. 3. Their time performance follows a similar rule: They use the same number of clock cycles for each element of the stream. another kind of gauge displays the sum of all the integers received. The problem of counting distinct elements is closely related to the problem of identifying the most-frequent elements in a stream.Instead of counting the stream elements. 8. If it’s necessary to search through all the stored numbers one by one for every newly received stream element. and compare each incoming value n with the numbers in the stored set. then the time needed grows in proportion to the size of the stored set. on the arrival of each integer n.) There’s a straightforward method for counting distinct elements: Start a counter at 0 and increment the counter whenever the stream delivers a number you haven’t seen before. But how do you know that a given number is making its first appearance? The obvious answer is to keep a list of all the integers you’ve encountered so far. The set of stored Logjam values can grow without limit. the interval between stream elements. 1. but they also reveal an extraordinary persistence of interest in a small constellation of starlets and other pop-culture figures. it has only four distinct elements: 3. The frequent-elements algorithm needs the same list. (Psychologically. ignore the new copy. The algorithms are said to run in constant space and constant time. demands for space are again potentially unbounded.org. Not every property of a stream can be computed in constant space and time. the program must maintain a counter for each element. Since the list of elements and their counters can grow without limit. Having a solution to one problem gives us a head start on the other. that n becomes the new x. the two gray bars mark holiday weeks when no rankings were published.) Data are from http://50. 3 has a total of seven elements. The time span shown is an 87-week period from September 2006 to May 2008. 4. Still another easy-to-build gauge shows the maximum value of all the integers seen so far. what matters is that the storage space remains constant no matter how long the stream. calculated as the quotient of the sum and the count. (For example. If n is already present in the list. Eventually. week numbers count from the first survey in 1999. but instead of simply noting whether or not an element is present. 1. giving its frequency.lycos. The graph records the trajectories of a few subjects that have made the list almost every week throughout the eight-year history of the rankings.com/ www. the storage requirement is just a single register. What matters most about the frugal memory consumption of these algorithms is not the exact number of registers needed. Reproduction with permission only. For each of these devices.

Even so. At this point in the cycle. For each element of the stream. The example shown here is the computation of the average. although other time constraints may remain.algorithm can no longer keep up with the incoming data. identify a set of size m that includes all the elements occurring with a frequency greater than n/(m+1). (In the case of m = 1. Reproduction with permission only. Unfortunately. all of the counters are decremented. install the current stream element as the new majority candidate (displacing any other element that might already be in the register). they flow to the right and are read as they pass by the averaging device. 276 American Scientist. Stream algorithms that require more than a constant amount of storage space are seldom of practical use in large-scale applications. Demaine and his colleagues have extended the algorithm to cover a more-general problem: Given a stream of length n. computing exact answers is simply not feasible—but it’s also not Getting It Almost Right © 2008 Brian Hayes. The basic scheme of operation is analogous to that of the majority algorithm. Demaine of MIT and Alejandro LópezOrtiz and J. The second register is a counter initialized to 0. but not all the candidates are necessarily above the threshold. The algorithm that accomplishes this task requires just two registers.) The extended algorithm requires m registers for the candidate elements as well as m counters. when there is no match to any candidate. the associated candidate is replaced by a new element from the stream. The algorithm counts the elements and sums them in two registers of auxiliary storage. that element is in the candidate register. What if there is no majority element? Without making a second pass through the data—which isn’t possible in a stream environment—the algorithm cannot always give an unambiguous answer in this circumstance. (Before reading on you might want to try constructing such an algorithm for yourself. when the most common item is so popular that it accounts for a majority of the stream entries (more than half the elements). Majority Rules Although identifying the most frequent item in a stream is hard in general. At all stages of the computation the output is the mean of all the integers in the segment of the stream seen so far. or arithmetic mean. Fischer and Steven L. It merely promises to correctly identify the majority element if there is one. Contact bhayes@amsci. and the counter holds a value greater than 0. otherwise.org. those elements will appear among the candidates. some computations on streams are easy to implement. we ask the algorithm to perform the following rou- A stream algorithm must process its input on the fly. if the current element matches the majority candidate. the corresponding counter is incremented. Salzberg of Yale University. Even with this drawback. the output is the quotient of these quantities. The majority algorithm was invented in 1982 by Michael E. When a stream element matches one of the candidates. thereafter they cannot be examined again. Volume 96 tine. . computed in a third register (red border). Again the results carry only a weak guarantee: If any elements of the stream exceed the threshold frequency. this reduces to the majority problem. I should point out that there are also a few pleasant surprises in the world of stream algorithms. such as trees and indexed arrays. The stream elements are integers emitted from the source at the left. and it runs in a constant amount of time per stream element. Ian Munro of the University of Waterloo in Canada. Then. one element at a time. if the part of the stream seen so far has a majority element. Other data structures.) The majority-finding algorithm uses one of its registers for temporary storage of a single item from the stream. there is really no hope of creating a constant-space algorithm that’s guaranteed always to give the correct answer. The version I have described here comes from a 2002 article by Erik D. the algorithm performs impressive feats. allow for quicker access. for tasks such as counting distinct elements and finding most-frequent elements. there is an ingenious way of doing it in one special case—namely. If the counter reads 0. such as scanning a stream of Web search queries for all terms that make up at least 1 percent of the traffic. this item is the current candidate for majority element. Streams of any length can be handled with only a fixed amount of memory. But before we grow too despondent about this bleak outlook. if a counter is at 0. decrement the counter. increment the counter. in sequence. For many stream problems of practical interest.

will serve just fine. answers derived from a sequence of blocks only approximate answers for the stream as a whole. turning a continuous process into a series of batch operations. Their sticky-sampling method refines the selection each time the stream doubles in length. On the other hand. the output is the length of the list. If the stream consists of just 100 elements. Here I shall briefly mention work in two areas. but only in a probabilistic sense. In the version shown here the lists of items and counters are kept sorted with the most frequent items first. or an answer that’s probably correct.org. any sample drawn from the stream is most likely to include those very items. Reproduction with permission only. A good estimate. A window of length k holds the most recent k elements from the stream. any random sample represents an approximate solution to the most-commonelements problem.necessary. Indeed. Suppose you want to select a representative sample of 100 items from a stream. Within a block.org Unbounded memory requirements make some stream algorithms intractable. the task is easy: Take all the elements. Contact bhayes@amsci. Counting the number of distinct elements in a stream (above) seems easy at first: Just increment a counter on the first appearance of each element. When the items that interest you are the most frequent ones in a stream. an item counted in two successive blocks would be counted only once if the blocks were combined. Manku and Motwani propose a scheme for continually readjusting the selection probability without having to start over with a new sample. 2008 July–August 277 . and there are successes to report.americanscientist. But knowing whether or not an element is new requires keeping a record of those already seen. For example. and so on. If you run the algorithm www. The algorithm solves the most-frequent-elements problem in constant space. An algorithm for identifying the most frequent items in a stream (below) is similar but includes a counter for each item. A great deal of ingenuity has been applied to the search for better approximate stream algorithms. maintaining counters that estimate how often each item appears in the full stream. © 2008 Brian Hayes. at 400 elements. A further complication is that the output is not a single number but an entire data structure. you are not limited to algorithms that obey the one-pass-only rule. statistics is on your side. Each new element is added to the front of a list. A variant of the block idea is the sliding window. in a count of distinct elements. But the simplest sampling strategy is not necessarily the most efficient or the most accurate. the correct probability is 1/4. which can grow without limit. you can make a new random selection with probability 1/2. with each new arrival displacing the oldest item in the queue. One simple approach to approximation is to break a stream into blocks. or in other words select them with probability 1. When the stream extends to 200 elements. In 2002 Gurmeet Singh Manku and Rajeev Motwani of Stanford University described two methods they called sticky sampling and lossy counting. based on sampling and on hashing.

the results will likely differ. Are there better algorithms for stream problems waiting to be discovered? Some of the limits of what’s possible were outlined in 1996 by Noga Alon. 278 American Scientist. but in this case the algorithm is deterministic.com and Amr El Abbadi of the University of California. The clerk’s rule says that if two letters have the same address. otherwise a new entry is added. Think of hashing as the action of a mad postal clerk sorting letters into bins. but in all other respects the sorting might as well be random. Imagine a stream composed of numbers chosen at random from some fixed range of integers. This kind of analysis won’t work for most streams of real data because the elements are not drawn from a convenient fixed range. the current element should be installed in the candidate register. otherwise decrement it. in particular. Lossy counting is based on a similar idea of continually refining a sample as the stream lengthens. say 1 through n. Yossi Matias and Mario Szegedy. all n distinct elements should have appeared at least once by the time the stream reaches length n log n. The rules state that if the counter reads 0. or from the minimum number of elements found in any bin. Then. evaluates a dozen of these algorithms. hashing allows for easy statistical analysis of the stream’s elements. Each stream element is checked against a stored list. The zero-order frequency moment Gently Down the Stream Making A Hash of It An algorithm that uses just two memory registers can identify the most common element of a stream in the special case where that value makes up more than half the elements. For example. This is where hashing comes in. but in fact randomness opens the door to some simple but powerful statistical methods. To keep the list from growing uncontrollably. Lossy counting is not guaranteed to run in constant space. if the element is already present. Volume 96 © 2008 Brian Hayes. a counter is incremented. Oddly enough. Contact bhayes@amsci. if the current element matches the candidate. it is periodically purged of infrequent elements. who were all then at AT&T Bell Laboratories. this turns out to be a good way of making sense of the data. Hashing—one of the more aptly named techniques of computer science—is a way of scrambling data so that the result looks random. They framed the issue in terms of an infinite series of frequency moments. they go into the same bin. Santa Barbara. A completely random stream might seem hard to analyze. A hash function transforms a data item into a more-or-less random value uniformly distributed over some interval. and they are not random—unless you compose searchengine queries by blindly pecking at the keyboard. and so on). variance.org. increment the counter. For this stream the number of distinct elements observed follows a highly predictable course. but Manku and Motwani report that in practice it performs better than sticky sampling. analogous to the more familiar statistical moments (mean.twice on the same data. In the example worked out here. Reproduction with permission only. and 4 is a majority of the first seven elements. By randomly distributing all the elements of a stream into a fixed number of bins. the device correctly shows that 6 is a majority item for the first three elements of the stream. . A recent paper by Ahmed Metwally and Divyakant Agrawal of Ask. the number of distinct elements can be estimated from the probability that a given bin remains empty. making no use of random choices.

pp. 2002.” But I like to think the effort might be justified. Amit.. 2008. Solution to problem 81–5. Cristian. one that makes essential use of randomness. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. In Proceedings of the 28th International Conference on Very Large Data Bases. All of the frequency moments are defined as sums of powers of mi. pp. What’s hot and what’s not: tracking most frequent items dynamically. 348–360. 2007. Contact bhayes@amsci. Dimitropoulos. Demaine. Perhaps then a curious reader will be led into new lines of inquiry. Moses. The space complexity of approximating the frequency moments.org. Finding frequent items in data streams.org © 2008 Brian Hayes. Scott Shenker. pp. still higher moments describe the “skew” of the stream. pp. The same is true of F2.F0 is the number of distinct elements in the stream. Approximate medians and other quantiles in one pass and with limited memory. 1996. pp. 2003. giving greater emphasis to the most-frequent elements. 2002. Bibliography Alon. It’s interesting that calculating F1 is so easy. pp. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. 2005. Alon. Manku. Languages and Programming. Charikar. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing. A near-optimal algorithm for computing the entropy of a stream. In Proceedings of the 10th Annual European Symposium on Algorithms. 2002. ACM Transactions on Computer Systems 21(3):270–313. and George Varghese. and Rajeev Motwani. Richard M. Approximate frequency counts over data streams. We can determine the exact length of a stream with a simple. and Steven L. Ahmed. Manku. In Proceedings of the 11th International Conference on Extending Database Technology: Advances in Database Technology. We can get an exact value of F0 (the number of distinct elements) only by supplying much more memory. 20–29. Cormode. Noga. Gurmeet Singh. Journal of Algorithms 3(4):375–379. 2008 July–August 279 . and the sum is taken over all possible values of i. All this mathematics and algorithmic engineering seems like a lot of work just for following the exploits of a famous “pop tart. www. Yossi Matias and Mario Szegedy. 426–435. Frequency estimation of internet packet streams with limited space. New directions in traffic measurement and accounting: Focusing on the elephants. Estan. 618–629. where mi is the number of times that item i appears in a stream. Gurmeet Singh. Graham Cormode and Andrew McGregor. For the higher moments. there are no constant-space methods at all. In Proceedings of the 29th International Colloquium on Automata. Probabilistic lossy counting: an efficient algorithm for finding heavy hitters. 328–335. Matias and Szegedy proved that no such algorithm exists for any other frequency moment. F6 and above. Alejandro López-Ortiz and J. Muthukrishnan. pp. Erik D. Years from now. 2003. Sridhar Rajagopalan and Bruce G. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems 30(1):249–278. Divyakant Agrawal and Amr El Abbadi. and Christos H. Lindsay. deterministic algorithm that runs in constant space. F2 is a quantity called the repeat rate. and S. Salzberg. 693–702. Graham. 1998. Paul Hurley and Andreas Kind. 2008. the first-order moment F1 is the total number of elements. Ian Munro. Chakrabarti. Karp. ACM Transactions on Database Systems 28(1):51–55. 346–357. Metwally. Michael J. Reproduction with permission only. ignoring the mice. Fischer.. 1982. Even approximating F0 in constant space is harder: It requires a nondeterministic algorithm. Kevin Chen and Martin Farach-Colton. ACM SIGCOMM Computer Communication Review 38(1):7–16.. Xenofontas.americanscientist. Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. someone will type “Britney Spears” into a search engine and will stumble upon this article listed among the results.

- Hashing (DASTAL)Uploaded byteztdummy
- Real EstateUploaded byNaveen Sharma
- Psyc 60 Central Tendency and Variability_2.pptUploaded byYosef Imanuel Yulius Opi
- Unit II, SamplingUploaded byPujitha
- 36_1Uploaded byMahyar Zobeidi
- MeanUploaded bydevasiri
- size_distribution.pdfUploaded byMichele Miccio
- class03_13Uploaded byDr. Ir. R. Didin Kusdian, MT.
- Session 7 (Chpt 5 & 6)(3)Uploaded byvisha183240
- 4R6Lab1 GM Operation 2017Uploaded bytarek moahmoud khalifa
- Biol171L HeightsUploaded byVincent Guzman
- Normal DistributionUploaded byRajesh Dwivedi
- Sis Mica 1212Uploaded byBemjamin Flores Dias
- HashingUploaded bygfdasdfghjkjhgc
- TQM - Statistical Process ControlUploaded byAjit
- Tea Stall With PageUploaded byMahmudurRahmanRumi
- Chapter 04Uploaded bythom
- Lesson Plan 5Uploaded byhannah shennan
- MA125_Fall_2011Uploaded byKar Chun
- dialog part gddUploaded byapi-392358254
- Statistics Study Guide - VarianceUploaded byldlewis
- chapter07-091211160423-phpapp02Uploaded byRosita Septiyani
- Project Cost Risk AnalysisUploaded byirju318
- Quality Control Ch 2 SolutionsUploaded bymtran21
- Uncertainty Analysis of Penicillin v Production Using Monte Carlo Simulation.Uploaded bySa'ed Abu Yahia
- Pseudorandom, Authenticated MethodologiesUploaded bythrw3411
- Intermediate GIS Skills With ArcGIS 10 TutorialUploaded byAequo Banda
- Arithmetic MeanUploaded byjrtlim
- Value Engineering (7)Uploaded byAmr Ragheb
- iemh114Uploaded byniranjanusms