Professional Documents
Culture Documents
-- HW4 due today! -- CS341 info session is on Thu 3/14 6pm in Gates415
http://cs246.stanford.edu
Uncertainty around information need => dont put all eggs in one basket
3/12/2013
France intervenes Chuck for Defense Argo wins big Hagel expects fight
France intervenes Chuck for Defense Argo wins big New gun proposals
Idea: Encode diversity as coverage problem Example: Word cloud of news for a single day
Want to select articles so that most words are covered
3/12/2013
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
3/12/2013
X2
X1 X4
3/12/2013
Example:
Eval. , , ({ }), pick best (say ) Eval. } { , , ({ } { }), pick best (say ) Eval. ({ , } { }), , ({ , } { }), pick best And so on
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9
3/12/2013
3/12/2013
10
Definition: Set function F() is called submodular if: For all A,B W: F(A) + F(B) F(A B) + F(A B)
+ A A B B
A B
3/12/2013
11
Diminishing returns characterization Equivalent definition: Set function F() is called submodular if: For all A B, sB:
B A
+ Xd + Xd
3/12/2013
F() is submodular: A B
Natural example:
Sets 1, , =
(size of the covered area)
Xd B
Xd
13
Claim: () is submodular!
3/12/2013
A B
F(A)
3/12/2013
|A|
15
3/12/2013
16
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
Inauguration weekend
3/12/2013
17
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
Inauguration weekend
Objective:
concept weights
3/12/2013
20
3/12/2013
Greedy
Marginal gain: F(Ax)-F(A) a
b
c d e Add document with highest marginal gain
3/12/2013
Part 2-22
By submodularity property:
for <
Observation: By submodularity: For every () () for < since Marginal benefits () only shrink!
(as i grows)
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
i(d) j(d) d
Selecting document d in step i covers more words than selecting d at step j (j>i)
23
3/12/2013
Idea:
Use i as upper-bound on j (j > i)
Marginal gain a b c d e A1={a}
Lazy Greedy:
Keep an ordered list of marginal benefits i from previous iteration Re-evaluate i only for top node Re-sort and prune
AB
24
Idea:
Use i as upper-bound on j (j > i)
Marginal gain a b c d e A1={a}
Lazy Greedy:
Keep an ordered list of marginal benefits i from previous iteration Re-evaluate i only for top node Re-sort and prune
AB
25
Idea:
Use i as upper-bound on j (j > i)
Marginal gain a d b e c A1={a} A2={a,b}
Lazy Greedy:
Keep an ordered list of marginal benefits i from previous iteration Re-evaluate i only for top node Re-sort and prune
AB
26
Summary so far:
Diversity can be formulated as a set cover Set cover is submodular optimization problem Can be (approximately) solved using greedy algorithm Lazy-greedy gives significant speedup
400
Lower is better
300
200
100
Lazy
0 1 2 3 4 5 6 7 number of blogs selected 8 9 10
27
3/12/2013
model
Recommendations
3/12/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
France intervenes
Chuck for Defense Argo wins big
3/12/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
politico
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
movie buff
3/12/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30
3/12/2013
France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL
=
33
1. 2. 3. 4. 5. 6.
3/12/2013
34
= I have 10 minutes. Which blogs should I read to be most up to date? = Who are the most influential bloggers?
3/12/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
?
35
Given:
For each information we know the time (, ) when cascade hits blog
max =
Expected reward for detecting cascade i
Reward
Maximize the number of blogs that hear the information after we do
If S are blogs we read, F(S) denotes the number of blogs that we beat Reward is submodular
Cascade i
F(S)
Read blue blog is better than green blog
3/12/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 38
Consider the following algorithm to solve the information cascade detection problem: Greedy that ignores cost
Ignore blog cost Repeatedly select blog with highest marginal gain Do this until the budget is exhausted
3/12/2013
39
Bad example:
blogs, budget : reward , cost : reward , cost Greedy always prefers more expensive blog with reward (and exhausts the budget). It never selects cheaper blog with reward For variable cost it can fail arbitrarily badly!
1 {} (1 ) = arg max
Greedily pick blog that maximizes benefit to cost ratio.
41
3/12/2013
2 blogs and :
Costs: () = , () = Only 1 information cascade: () = , () =
So, we first select but then can not afford We get reward instead of . Now send , we can get arbitrarily bad solution!
3/12/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 42
How far is CELF from (unknown) optimal solution? Theorem: CELF is near optimal
CELF achieves (1-1/e) factor approximation!
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 43
3/12/2013
Back to the solution quality! The (1-1/e) bound for submodular functions is the worst case bound (worst over all possible inputs)
Can we say something about the solution quality when we know the input data?
3/12/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 44
Let = {, , } be the OPT solution For each let = Lets order so that () () Then: + =
Note:
This is data dependent bound (() depend on input data) Bound holds for any algorithm For some inputs it can be very loose (worse than 63%)
3/12/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 45
3/12/2013
46
Heuristics perform much worse! One really needs to perform the optimization
Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 47
3/12/2013
3/12/2013
48