Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Download
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
6Activity
×
0 of .
Results for:
No results containing your search query
P. 1
Feeding Frenzy: Selectively materializing Users' Event Feeds

Feeding Frenzy: Selectively materializing Users' Event Feeds

Ratings: (0)|Views: 967|Likes:
Published by Señor Smiles

More info:

Published by: Señor Smiles on Dec 15, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less

04/14/2014

pdf

text

original

 
Feeding Frenzy: Selectively Materializing Users’ EventFeeds
Adam Silberstein
1
, Jeff Terrace
2
, Brian F. Cooper
1
, Raghu Ramakrishnan
11
Yahoo! ResearchSanta Clara, CA, USA{silberst,cooperb,ramakris}@yahoo-inc.com
2
Princeton UniversityPrinceton, NJ, USA jterrace@cs.princeton.edu
ABSTRACT
Near real-time event streams are becoming a key feature of many popular web applications. Many web sites allow usersto create a personalized
feed 
by selecting one or more eventstreams they wish to
follow 
. Examples include Twitter andFacebook, which allow a user to follow other users’ activ-ity, and iGoogle and My Yahoo, which allow users to followselected RSS streams. How can we efficiently construct aweb page showing the latest events from a user’s feed? Con-structing such a feed must be fast so the page loads quickly,yet reflects recent updates to the underlying event streams.The wide fanout of popular streams (those with many fol-lowers) and high skew (fanout and update rates vary widely)make it difficult to scale such applications.We associate feeds with
consumers
and event streams with
producers
. We demonstrate that the best performance re-sults from selectively materializing each consumer’s feed:events from high-rate producers are retrieved at query time,while events from lower-rate producers are materialized inadvance. A formal analysis of the problem shows the surpris-ing result that we can minimize global cost by making localdecisions about each producer/consumer pair, based on theratio between a given producer’s update rate (how often anevent is added to the stream) and a given consumer’s viewrate (how often the feed is viewed). Our experimental re-sults, using Yahoo!’s web-scale database PNUTS, shows thatthis hybrid strategy results in the lowest system load (andhence improves scalability) under a variety of workloads.
Categories and Subject Descriptors:
H.2.4 [Systems]:distributed databases
General Terms:
Performance
Keywords:
social networks, view maintenance
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.
SIGMOD’10,
June 6–11, 2010, Indianapolis, Indiana, USA.Copyright 2010 ACM 978-1-4503-0032-2/10/06 ...$10.00.
1. INTRODUCTION
Internet users like to be kept up-to-date with what is go-ing on. Social networking sites like Twitter and Facebookprovide a“feed”of status updates, posted photos, movie re-views, etc., generated by a user’s friends. Content aggrega-tors like My Yahoo and iGoogle let users create a customizedfeed aggregating multiple RSS sources. Similarly, news ag-gregators like Digg and Reddit provide a feed of the lateststories on topics like“entertainment”and“technology,”whilenews sites like CNN.com provide the ability to follow fine-grained topics like“health care debate.Each of these examples is a
follows
application: a userfollows one or more interests, where an interest might beanother user, content category or topic. In this paper, weformalize this important class of applications as a type of view materialization problem. We introduce the abstractionof a
producer 
, which is an entity that generates a series of time-ordered, human-readable events for a particular follow-able interest. Thus, a producer might be a friend, a website,or aggregator of content on a particular topic collected frommultiple sources. The goal of a follows application is toproduce a “feed” for a user, which is a combined list of thelatest events across all of the producers a user is following.For example, a feed might combine recent status updatesfrom all of the user’s friends on a social site, or recent sto-ries on all of the user’s topics on a content aggregation site.In some cases a user wants a combined feed, including bothsocial and topic updates. An important point to keep inmind for optimization purposes is that we need only showthe most recent events (specified in terms of a window of time or number of events) when a consumer checks his feed.Follows applications are notoriously difficult to scale. Theapplication must continually keep up with a high through-put of events. Twitter engineers have famously describedre-architecting Twitter’s back-end multiple times to keepup with rapid increases in throughput as the system be-came more popular (see for example [26]). At the sametime, users expect their feed page to load fast, which meanslatency must be strictly bounded. This often means exten-sive materialization and caching, with associated high cap-ital and operations expenditure. For example, Digg electedto denormalize and materialize a large amount of data to re-duce latency for their“green badge application(e.g., followwhich stories my friends have dugg), resulting in a blow upof stored data from tens of GB to 3 TB [10].There are several reasons why such applications are hardto scale. First, events fan out, resulting in a multiplicativeeffect on system load. Whenever Ashton Kutcher “tweets,”
 
his status update is propagated to over 4.6 million follow-ers (as of March 2010). Even a considerably lower aver-age fanout can cause severe scaling problems. Second, thefanouts and update rates have high skew across producers,making it difficult to choose an appropriate strategy. Face-book, for example, reportedly employs different feed mate-rialization strategies for wide-fanout users like bands andpoliticians, compared to the majority of users who havemuch narrower fanout.In this paper, we present a platform that we have builtfor supporting
follows
applications. Consider a
producer 
of events (such as a user’s friend, or news source) and a
con-sumer 
of events (such as the user himself). There are twostrategies for managing events:
push 
—events are pushed to materialized per-consumerfeeds
pull 
—events are pulled from a per-producer event storeat query time (e.g., when the consumer logs in)If we regard each consumer’s feed as a“most recent”windowquery, we can think of these options in traditional databaseterms as a “fully materialize” strategy versus a “query ondemand” strategy. Sometimes,
push 
is the best strategy, sothat when users log in, their feed is pre-computed, reducingsystem load and latency. In contrast, if the consumer logsin infrequently compared to the rate at which the produceris producing events, the
pull 
strategy is best. Since we onlyneed to display the most recent
events, it is wasteful tomaterialize lots of events that will later be superseded bynewer events before the consumer logs in to retrieve them.In our approach, a consumer’s feed is computed using acombination of push and pull, to handle skew in the eventupdate rate across producers: a particular user that logs inonce an hour may be logging in more frequently than oneproducer’s rate (so push is best) and less frequently thananother producer’s rate (so pull is best.) A key contribu-tion of this paper is the theoretical result that making
local 
push/pull decisions on a producer/consumer basis minimizestotal
global 
cost. This surprising result has great practicalimplications, because it makes it easier to minimize overallcost, and straightforward to adapt the system when rates orfanouts change. Our experiments and our experience with alive follows application show that this approach scales bet-ter than a purely push or purely pull system across a widerange of workloads. Furthermore, this local optimizationapproach effectively allows us to cope with flash loads andother sudden workload changes simply by changing affectedproducer/consumer pairs from push to pull. Thus, our tech-niques can serve as the basis for a general-purpose followsplatform that supports several instances of the follows prob-lem with widely varying characteristics.The follows problem is similar to some well-studied prob-lems in database systems. For example, the “materialize ornot” question is frequently explored in the context of indexand view selection [13, 4]. In our context, the question isnot which views are helpful, but which of a large number of consumer feed views to materialize. There has been workon partially materialized views [16] and indexes [27, 24, 25]and we borrow some of those concepts (e.g., materializingfrequently accessed data). In contrast to this previous work,we show that it is not possible to make a single“materializeor not”decision for a given base tuple (producer event) in thefollows problem setting; instead, we need to make that deci-sion for each producer/consumer pair based on their relativeevent and query rates. Other work in view maintenance andquery answering using views targets complex query work-loads and aggregation, while our scenario is specialized to(a very large number of) most recent window queries over(a very large number of) streams. We review related workmore thoroughly in Section 6.In this paper, we describe an architecture and techniquesfor large scale follows applications. We have implementedthese techniques on top of PNUTS [9], a distributed, web-scale database system used in production at Yahoo!. In par-ticular, we make the following contributions:
A formal definition of the follows problem as a partialview materialization problem, identifying properties thatmust be preserved in follows feeds (Section 2).
An analysis of the optimization problem of determiningwhich events to push and which events to pull, in orderto minimize system load (equivalently, maximize scala-bility) while providing latency guarantees. We establishthe key result that making push/pull decisions on a
local 
basis provides
globally 
good performance (Section 3).
Algorithms for selectively pushing events to a consumerfeed to optimize the system’s performance. Our algo-rithms leverage our theoretical result by making deci-sions on a per-producer/consumer basis. Further, thesealgorithms allow our system to effectively adapt to sud-den surges in load (Section 4).
An experimental study showing our techniques performwell in practice. The system chooses an appropriatestrategy across a wide range of workloads (Section 5).Additionally, we review related work in Section 6, and con-clude in Section 7.
2. PROBLEM DEFINITION
We now define the follows problem formally. First, wepresent the underlying data model, including consumers,producers, and the follows concept. Next, we examine userexpectations about the resulting customized feed. Then, wedefine the core optimization problem we are addressing.
2.1 Follows data and query model
A follows application consists of a set of 
consumers
(usu-ally, users) who are following the event streams generatedby a set of 
producers
. Each consumer chooses the produc-ers they wish to follow. In this paper, a
producer 
generatesa named sequence of human-readable, timestamped events.Examples of producers are “Ashton Kutcher’s tweets” or“news stories about global warming.” In general, a producermay be another user, a website (such as a news site or blog)or an aggregator of events from multiple sources. We treateach followable“topic(such as“global warming”or“healthcare debate”) as a separate producer in our discussion, evenif the content for different topics comes from the same web-site or data source. Events are usually ordered by the timethey were created (although other orderings are possible.)We define the
connection network 
as a directed graph
G
(
V,
), where each vertex
v
i
is either a consumeror a producer, and there is a follows edge
ij
froma consumer vertex
c
i
to a producer vertex
p
j
if 
c
i
follows
 p
j
(i.e.,
c
i
consumes
p
j
’s events.) Social networks are oneexample of a type of connection network. Other examples
 
include people following each other on Twitter, status andlocation updates for friends, customized news feeds, and soon. While these instances share a common problem formu-lation, the specifics of scale, update rates, skew, etc. varywidely, and a good optimization framework is required tobuild a robust platform.We can think of the network as a relation
Connection-Network (Producer,Consumer)
. Typically, the
connection network 
is stored explicitly in a form that supports effi-cient lookup by producer, consumer or both. For exam-ple, to push an event for producer
p
j
to interested con-sumers, we must look up
p
j
in the producer network andretrieve the set of consumers following that event, which is
{
c
i
:
ij
}
. In contrast, to pull events for a consumer
c
i
, we need to look up
c
i
in the network and retrieve theset of producers for that consumer, which is
{
 p
j
:
ij
}
.In this latter case, we may actually define the relation as
ConnectionNetwork(Consumer,Producer)
to support clus-tering by Consumer. If we want to support both accesspaths (via producer and via consumer), we must build anindex in addition to the
ConnectionNetwork
relation.Each producer
p
j
generates a stream of events, whichwe can model as a producer events relation
P
j
(EventID,Timestamp, Payload)
(i.e., there is one relation per pro-ducer). When we want to show a user his feed, we mustexecute a
feed query 
over the
P
j
relations. There are twopossibilities for the feed query. The first is that the con-sumer wants to see the most recent
k
events across all of theproducers he follows. We call this option
global coherency
and define the feed query as:
Q1.
σ
(
k
most recent events
)
j
:
ij
P
j
A second possibility is that we want to retrieve
k
eventsper-producer, to help ensure diversity of producers in theconsumer’s view. We call this option
per-producer co-herency
and define the feed query as:
Q2.
j
:
ij
σ
(
k
most recent events
)
P
j
Further processing is needed to then narrow down the per-producer result to a set of 
k
events, as described in the nextsection. We next examine the question of when we mightprefer global- or per-producer coherency.
2.2 Consumer feeds
We now consider the properties of consumer
feeds
. A feedquery is executed whenever the consumer logs on or refreshestheir page. We may also automatically retrieve a consumer’supdated feed, perhaps using Ajax, Flash or some other tech-nology. The feed itself is a display of an ordered collectionof events from one or more of the producers followed bythe user. A feed typically shows only the
most recentevents, although a user can usually request more previousevents (e.g., by clicking“next”). We identify several proper-ties which capture users’ expectations for their feed:
Time-ordered:
Events in the feed are displayed in times-tamp order, such that for any two events
e
1
and
e
2
, if Timestamp(
e
1
)
<
Timestamp(
e
2
), then
e
1
precedes
e
2
in the feed
1
.
1
Note that many sites show recent events at the top of the
Gapless:
Events from a particular producer are dis-played without gaps, i.e., if there are two events
e
1
and
e
2
from producer
,
e
1
precedes
e
2
in the feed, and thereis no event from
in the feed which succeeds
e
1
but pre-cedes
e
2
, then there is no event in
P
j
with a timestampgreater than
e
1
but less than
e
2
.
No duplicates:
No event
e
i
appears twice in the feed.When a user retrieves their feed twice, they have expec-tations about how the feed changes between the first andsecond retrieval. In particular, if they have seen some eventsin a particular order, they usually expect to see those eventsagain. Consider for example a feed that contains
= 5events and includes these events when retrieved at 2:00 pm:
Feed 1
Event Time Producer Tex
e
4
1:59 Alice Alice had lunch
e
3
1:58 Chad Chad is tired
e
2
1:57 Alice Alice is hungry
e
1
1:56 Bob Bob is at work
e
0
1:55 Alice Alice is awakeAt 2:02 pm, the user might refresh their feed page, causinga new version of the feed to be retrieved. Imagine in thistime that two new events have been generated from Alice:
Feed 2
Event Time Producer Tex
e
6
2:01 Alice Alice is at work
e
5
2:00 Alice Alice is driving
e
4
1:59 Alice Alice had lunch
e
3
1:58 Chad Chad is tired
e
2
1:57 Alice Alice is hungryIn this example, the two new Alice events resulted in thetwo oldest events (
e
0
and
e
1
) disappearing, and the global or-dering of all events across the user’s producers are preserved.This is the
global coherency
property: the sequence of events in the feed matches the underlying timestamp orderof all events from the user’s producers, and event orders arenot shuffled from one view of the feed to the next. Thismodel is familiar from email readers that show emails intime order, and is used in follows applications like Twitter.In some cases, however, global coherency is not desirable.Consider the previous example: in Feed 2, there are manyAlice events and no Bob events. This lack of diversity re-sults when some producers temporarily or persistently havehigher event rates than other producers. To preserve diver-sity, we may prefer
per-producer coherency
: the orderingof events from a given producer is preserved, but no guaran-tees are made about the relative ordering of events betweenproducers. Consider the above example again. When view-ing the feed at 2:02 pm, the user might see:
Feed 2’
Event Time Producer Tex
e
6
2:01 Alice Alice is at work
e
5
2:00 Alice Alice is driving
e
4
1:59 Alice Alice had lunch
e
3
1:58 Chad Chad is tired
e
1
1:56 Bob Bob is at workThis feed preserves diversity, because the additional Aliceevents did not result in the Bob events disappearing. How-page, so“preceded”in the feed means“belowwhen the feedis actually displayed.

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->