You are on page 1of 8

Building a Relevance Engine with the Interest Graph

140 Proof Research


Kumar Dandapani, John Manoogian III

Introduction

and then combining this information with components of network graph theory. This approach
allows us to understand the underlying data generating processes and the statistical properties of
the explanatory variables that expose a users interests. This paper describes the empirical and
theoretical foundations of our approach.

The proliferation of electronic social networks


since the late 2000s has generated an abundance
of data on individuals and their interests. The
volume and cacophony of this data creates significant challenges in characterizing social media
users and identifying relevant information from
the social stream. Although relevance is often
a nebulous, contextual concept, it can be safely
defined as occurring when content engages the
interests of a user. Consequently, arriving at relevant content motivates the creation of a systematic and falsifiable mechanism for identifying the
interests of network users. At 140 Proof, this effort has come to be part of a broader development
effort called the Relevance Engine.

1. Social Networks
Social media networks share many common components. This section describes the participants
in these networks, their motivations, behaviors,
and how one can use these qualities to obtain relevant information from the interest graph.

1.1 Users

A commonly used approach for building interest groups is to cluster users with similar characteristics, often by their usage of keywords, but
often these approaches preclude falsifiability and
rely heavily on intense computation as a substitute for acquiring subject area expertise on social
media interests. While such techniques are capable of maximizing a well-defined objective function, such as the number of Facebook Likes over
a period of time, they can often lead to conclusions that suggest relationships between variables
that are nonsensical or prove to be unstable over
time. The Relevance Engine represents a departure from such prevailing techniques by providing
a framework for acquiring subject area expertise
from experiments on social media engagement

Participants in social media vary widely in their


usage of the network and their propensity for engagement. Some individuals focus more heavily
on the creation of content, while others use social media primarily as a content consumption
mechanism. Our approach to identifying interests accounts for these different uses by relying
less on user-generated content and more heavily
on observed behavior to social stream content.

1.2 Interests
For the purposes of our investigation, interests
are defined as a collection of subjects, activities,
1

engagement rates when compared to the overall


network. An example of this broad, general content would be a non-partisan news article. While
all content has the potential for unwittingly being interest-specific, if the engagement rate for
the social media content is within the margins
of random variation centered around the benchmark rate of engagement for the network, it can
effectively be described as interest-neutral. The
benefit of having such content is that it provides a
control for the varying levels of engagement that
are inherent to participants in social media. Users
will vary widely in their baseline propensity to
observably engage with any type of social media
content and a failure to recognize and correctly
quantify this engagement propensity can create a
power confound that results in Type I and Type
II erroneous inferences. By controlling for these
biases, the Relevance Engine can allow us to evaluate if a user matches a given interest regardless
of their baseline engagement propensity.

or attitudes that are capable of reliably drawing


the attention of a subset of the population. Designing content that engages a group of users that
match an interest is a challenging, but critical,
input to the Relevance Engine. If it proves to
be impossible to generate social media content
that engages some subset of the population disproportionately, then it is likely that the interest
is either too broadly defined (e.g. right-handed
people) or it represents a concept that is so abstract (e.g. identifying idealists) that it cant be
captured with social media content. As a result,
designing social stream content for a given interest draws heavily on subject area expertise in
social media.

1.3 Engagement
Social networks have several mechanisms to observe a users response to specific content in the
social stream. Some of the most common examples include the Like button on Facebook, the
Favorite or Retweet options on Twitter, and the
option to +1 on Google+. The development of
the Relevance Engine does not rely exclusively on
a specific engagement mechanism, but requires
that interest-specific or interest-neutral content
be presented in a manner that allows one to unambiguously engaged exclusively with the stimulating content.

An additional benefit of interest-neutral content


is to measure if interest-specific content in our inventory is anomalous. Content that is universally
more engaging for reasons beyond its intent (e.g.
an objective news story with a provocative title)
will likely observe a higher engagement rate for
reasons other than the fact that it resonates with
users that share a specific interest. Such content
can be readily handled by comparing the the statistical properties of each piece of content with
the broader network.

1.4 Content
Social networks offer multiple channels in which
a user is presented with content. Tweets, wall
posts, and links are some common examples of
these content distribution mechanism. Many of
these channels offer a way in which both interestspecific and interest-neutral content can be delivered.

2. Defining Interests by Engagement


Identifying the interests of users by their engagement behavior is achieved through inferential statistical procedures in which each piece of social
media content is treated as a binomial random
variable and the users decision to observably engage with content is coded as a success. This section describes our approach to designing interest
identification experiments and the application of
binomial sequential tests as a way of quantifying a

The Relevance Engine draws upon the concept


of interest-neutral content for determining a baseline engagement rate for a given network user. By
design, certain types of content should not garner a disproportionate amount of attention and
not produce statistically significant differences in
2

to Ii using a binomial distribution.

users propensity to engage with interest-specific


content beyond their baseline rate of engagement.

A
1 X
Aj
NA

(1)

Uu,Ii B(NA , Ii )

(2)

Ii =

j=1

2.1 Experimental Design

From the binomial distribution the mean


and variance of proportion of engagements for
interest-specific content are

Group sequential multiple sampling procedures


provide a compelling framework for inferring the
interests of a given user. Sequential sampling is
an alternative to fixed sample-size tests, which
are expensive in terms of both media inventory
and the amount of time consumed before statistically valid conclusions can be drawn. In the
context of social networks, sequential sampling is
particularly compelling given that the number of
sessions per user is not known beforehand and
is subject to a high degree of variability due to
unpredictable group sizes.

E[Uu,Ii ] = NA Ii

(3)

V[Uu,Ii ] = NA Ii (1 Ii )

(4)

Similarly, we define C to be a matrix with cardinality NC to represent the interest-neutral control group of the experiment that are presented
to all network users. It is again assumed that
each piece of content is independent and can be
presented in random order to the user without
introducing any pronounced confounds.

To formalize this model, we start by defining I


as a vector of unique interests that are believed
to be capable of being measured. The objective of our sequential trials is to map these interest onto the user space U . Using this notation,
Ii (i = 1, 2, ...) represents the ith interest under
investigation during a given trial. Trials are conducted on a per user basis, so Uu,Ii (u = 1, 2, ...) is
a dichotomous variable indicating if the uth user
on the network shares an interest in Ii .

NC
1 X
Cj
NC

(5)

Uu,C B(NC , C )

(6)

C =

j=1

The binomially distributed random variable


Uu,C describes our estimates of the baseline social media engagement rate for user u. Based on
asymptotic properties, the standard error of our
sample estimated user social media engagement
rate, E[Uu,C ], will decrease as the user is exposed
to more content from the control inventory.

Each interest is represented by an inventory of


social media content that is denoted by the matrix A where Aj (j = 1, 2, ...) represents the jth
piece of content for Interest Ii . To make this
problem more tenable, it is assumed that each
piece of content Aj is contextually independent
of A1...N , where a total of N pieces of content
have been constructed for Ii , and that order the
order of the content does not affect the likelihood
of engagement. Given N unique interest-specific
pieces of content in set A for Ii , the distribution
of engagement per user can be described from a
series of independent Bernoulli trials where Ii is
the sample proportion of interest-specific, social
stream content with which the user observable engaged (i.e. success). From this, we can describe
a given users engagement behavior with respect

E[Uu,C ] = NC C

(7)

V[Uu,C ] = NC C (1 C )

(8)

The sequence of binomial trials for a given users


where i (1, N ) and chosen from alternating assignment, random sampling without replacement
as follows:
Cj , Ii , Cj+1 , Ii+1 . . . Cj=NC , Ii=NA

(9)

Under the assumptions of independence described by Meeker (1981), the nominal excepted
3

engagements xIi and xC can be obtained from the


formula
P (xIi , xC ; NA , NC , Ii , C ) =
 
NA
Ii xIi (1 Ii )NA xIi
x Ii
 
NA
C xC (1 2 )NC xC
xC

The complete framework for sequential tests then


becomes
p
P r{|W1 | < 2nk o2 c1 (), . . . ,
p
p
|Wk1 | < 2nk1 o2 ck1 (), |Wk | > 2nk o2 c1 ()|
Wj N (0, 2nj o2 ), j = 1, . . . , K}




nk 2
nk1 2
=

, k = 1, . . . , K (16)
2 nmax
nmax

(10)

2.2 Analyzing Sequential Trial Experiments

Given a predetermined Type II error power estimate of and ck (), k = 1, . . . , K, under the
assumption of normality, the conditions for stopping becomes:
p
P r{ 2nk 2 1 (1 /2) <
p
W < 2nk 2 1 (1 /2)| = } = (17)

Sequential binary testing allows us to accept or


reject the null hypothesis that for a given user
u the engagement rate for interest-specific content, E[Uu,Ii ] cannot be distinguished from that
of the users baseline engagement rate E[Uu,C ].
The two-sided hypothesis being tested is: H0 :
P r(Uu,Ii ) = P r(Uu,C ) vs. HA : P r(Uu,Ii ) 6=
P r(Uu,C ). Extending on the formulation described by Jennison and Turnbull (1993), we let
= Ii C represent the parameter indicating
that the proportions are not equivalent. We start
by defining a marginal distribution of Wk for the
case where the user receives an equal number of
control and interest-specific items, nk . Under this
condition, the parameter estimate can be approximated from the normal distribution from the following
Wk =

nk
X
i=1

XIi

nk
X

Statistically valid inferences can then be drawn


from this test with the minimum sample size nk
calculated as follows



2
2 2

1
1
nk = 2
1
+ (1 )
(18)

2
One complication in the analysis of social media
content is that the proportion of successes can be
quite low even when presenting the most engaging content available. Low proportions make it
challenging to arrive at unbiased parameter estimates and thus making it difficult to make valid
inferences during the assumption of normality.
To address this issue, we explore the calculation
of bias-adjusted Maximum-Likelihood Estimates
as described in (Brown 2002). Alternative binomial proportion confidence intervals for engagement rate can also be obtained using the Wald
Interval approach, RCIs, and the Agresti-Coull
approach.

XC N (nk , 2nk 2 ) (11)

i=1

Wk N (nk , nk {Ii (1 Ii ) + C (1 C )})


(12)
We start with the null hypothesis that rates of
engagement are identical for treatment and control
Wk N (0, 2nk o2 ), k = 1, . . . , K

(13)

2.3 Content Quality Control

and we can accept or reject a one or two-sided


null hypothesis as follows

Wk > ck () 2kn 2 , k = 1, . . . , K
(14)

(15)
Wk < ck () 2kn 2 , k = 1, . . . , K

In our sequential testing framework, users consistently receive interest-neutral content and over
time the standard errors around the parameter
estimate of the expected proportion of successful
engagements with interest-neutral content should
4

decline while still providing a framework to account for the changing engagement behavior of
users on the network. With this approach, the
control content provides a quality control check
and a mechanism to monitor a users changing interests. If a given piece of interest-neutral content
is generally less compelling than interest-specific
content, then users are likely to be erroneously
classified as harboring a particular interest, leading to false discovery. To control for this effect, we
continually assess if a particular piece of content
is universally more or less compelling by looking
at the distributional properties of each piece of
interest-neutral social media content across the
network. If the engagement rate is abnormally
high or low relative to the network response rate,
then we exclude it from interest identification.
This helps to ensure that interest-neutral content
is truly interest neutral. As more sequential trial
are conducted such techniques for quality control
will see improvement.

pended when once the conditions is met. The


simplest approach to stopping trials early is based
on the log-odds ratio of the treatment to the control based on a predetermined threshold


Ii (1 C )
i = log
(20)
C (1 Ii )
The Relevance Engine also uses the Wald Interval. Upon observing a p that is outside of the
long-memory bounds relative to our defined Type
I significance level we suspend trials for that interest to the user.
p n1/2 (
p(1 p))1/2
where k =

(1)

2.5 Future Research


The following topics represent of future research
in both developing an analyzing sequential trials
on interest-based engagement.

For j (1...J), we define the total number of


engagements
per interest-specific item as S =
PU P
A
.
From
MLE, we arrive at the lower
j
u=1
pL and upper pU bound confidence intervals,
(pL , pU ) = (pL (S), pU (S)) then the acceptance
range characterizing a piece of content as being
within the range of normal engagement is defined
as
P r[S s | p = pU (S)] = P r[S s | p = pL (S)] =

(21)

(19)

2.4 Early Stopping Rules for Sequential


Trials

(1) Time-Series Analysis: Future models will attempt to address the time-series component of interests. One such model would propose a weighting scheme to emphasize more recent engagement
behavior. The existing approach presumes the interests are stationary creating a need to periodically run trials on the user to measure changing
interests over time. One such example is a user
that is a new parent engaging with baby related
social media content when previously that user
did not align with such interests.
(2) Assignment Mechanisms: By acknowledging
the possibility that the the response rate at t
is potentially influenced by the response rate at
t 1, we can compare alternating assignment,
random assignment, adaptive sequencing assignments. A major assumption behind this experimental design is that the decision to engage with
the control is independent of the treatment and
the decision to engage with one piece of content
does not preclude the ability to engage with another piece of content. As long as treatment
and control do not compete with one another for

The rules for stopping a given trial are based


on the likelihood ratio. We define i to be
the log odds ratio (sequential probability ratio) that measures the difference in engagement
rate between the content specific to interest Ii
and interest-neutral content C. By running a
two-sided hypothesis test we can stop showing
interest-specific impressions to a user when it becomes clear that he/she has no propensity for
those impressions. Interest-specific content is sus5

3.1 Modeling Network Associations

the users attention, this independence can generally be assumed, but by controlling for the order
in which the content is received we can address
concerns about the impact of fatigue from being
shown irrelevant content on engagement behavior.

We start this process by identifying the subset of


users that engaged with interest-specific content
at a rate that exceeded their baseline engagement
rate and create a nxm matrix Z that represents
user-relationships per interest Ii . In this matrix,
we use h to represent the user that engaged with
interest-specific content and r to represent Using
this notation Zhr = X indicates the number of
degrees in that connection where X (0, 1, 2, 3)
corresponding to the degrees of separation of the
network association.

(3) Dropouts: Given the potential for users to


enter and exit a social network, future research
will attempt to generate estimates that account
for users that abandon social media or that only
receive a small number of treatments.
(4) Principal component analysis: There are several endogenous qualities associated with users
that have compelling causality with engagement.
Such factors include the number of network associations, days since the user joined the network,
gender, and location. Research in this area would
also help us account for exogenous factors such as
major news events and announcements and their
confounding impact on the network. The presence of such events might result in users being
disinclined to engage with both interest-neutral
or interest-specific content for a large window.

3.1.1 Simmelian Tie Measures


As described in Krackhardt (1999), Simmelian
Ties are a way of capturing strong associations
in the social graph by observing reciprocity in associations. Simmelian ties are defined as a variant
of symmetric relationships in the network where
0
Z Z represents all of the mutual relationships
between users h and r. From this foundation,
we can characterize
the subset of Simmelian ties
N
as S = Y (Y 2 ). These associations are then
codified with a dichotomous variable in our logit
framework.

3. Defining Interests by Network


Associations
Exclusively relying on an engagement-based interest graph fails to address the sizable proportion of social media users that never observably
engage with any type of content. As a result,
the Relevance Engine uses observed, engagement
data as a seed to identify which network associations are suggestive of a user-interest match. The
process within the Relevance Engine is to identify
the first, second and third degree associations and
the Simmelian ties between users that have been
classified as sharing a particular interest and all
other users on the network. With this association matrix, we run logit regressions to see which
of those associations are statistically significant.
The explanatory power of certain network associations on interest alignment are then tested in
a statistical framework and the forecasting power
of these models is tested out-of-sample.

3.2 Logit Regressions


Logit regressions serve as the main framework
by which we identify which nodes on the network have explanatory power on interest-specific
engagement. Following our notation on the expected engagements on a given interest by user
u, we specify the following logit model and use
maximium-likelihood estimation to arrive at the
model parameters.
logit(E[Uu,Ii |Zu,1 . . . Zu,m ) =
0 + 1 Zu,1 + + m Zu,m

(22)

The logit model is then reduced to the subset



of values where ML estimated t-values ti = nii
exceed a pre-specified critical value.
6

3.2.1 Heckman-style Model

trial experiments on a fresh subset of the network (2) perform an out-of-sample validation of
our second stage social graph user-interest classification on a second set of users.

The use of a Heckman-style model is motivated


by the desire to correct for the inherent selfselection bias associated with interest-based engagement that results in a non-random sample.
This non-random sample has the potential for
creating a skewed social graph for a given interest
by failing to account for users that never observably engage. The basic Heckman procedure to
correct for this problem are described by Sartori
(2003)
0

y i = x i + i
0

Ui = wi + ui

4.1 Methodology
To validate our interest graph inferences, the population of network users is divided randomly into
two groups A and B in which we restrict the samples to users that have participated in the social
network for a comparable period of time to control for adoption differences. After defining a set
of N reasonable interests, the engagement-based
sequential tests are performed on group A. The
network associations of the users in group A are
defined for each interest and a Zi matrix is created for each interest from i = 1 . . . N . We apply
the logit model fit on group A onto group B to arrive at expected probabilities of engagement YP 2 .
Interest-specific content is then shown to users in
group B and the forecasted engagement rate is
then compared to the observed engagement rate.

(23)
(24)

This is in many ways similar to an omittedvariable misspecification problem in and OLS regression. Where is the standard normal distribution and is the cumulative standard normal
distribution.
"
#
0
( wi )
0
E(yi ) = xi +
(25)
( 0 wi )

4.2 Evaluation
3.2.2 Equivalence Relations
The objective of this evaluation is to see if the
Relevance Engine produces results in which the
users in the out-of-sample group that have been
classified as sharing have a higher engagement
rate for interest-specific content than interestneutral content. We evaluate our performance
by comparing the root-mean-squared error of
the out-of-sample data with the in-sample data,
where P 1 corresponds to estimated engagement
based on the in-sample logit estimate and P 2 corresponds to the out-of-sample observed value.
q
2

RM SD(YP 1 , YP 2 ) = E((YP 1 YP 2) ) ) (26)

The ubiquity of certain network associations (e.g.


celebrities) can pose challenges in a regression
framework. An approach to mitigating the effects
of multicollinearity in our network node regressions is to identify equivalent relations between
each of the nodes on the network by performing
a Turing reduction on the matrix Z for each interest.

4. Model Validation
Falsifiability is a key feature of the Relevance Engine and this section outlines the ways in which
we can validate the success of our classification
mechanism. There are two main types of validations that need to be performed are: (1) assess
the stability of content-based engagement as determined by the ability to replicate the sequential

References
[1] C. Jennison and B.W. Turnbull Group sequential tests and repeated confidence intervals, with Applications to Normal and Binary
7

Responses Biometrics, Vol. 49., 1993, pp. 3143.


[2] W.Q. Meeker A Conditional Sequential Test
for the Equality of Two Binomial Proportions
Journal of the Royal Statistical Society. Series
C, Vol 30., 1981, pp. 109-115
[3] A.E. Sartori An Estimator for Some BinaryOutcome Selection Models Without Exclusion
Restrictions. 2003.
[4] W. Lehmacher and G. Wassmer Adaptive
Sample Size Calculations in Group Sequential
Trials. Biometrics, Vol. 55., 1999, pp. 12861290
[5] R.Simon, G.H. Weiss, and D.G. Hoel Sequential Analysis of Binomial Clinical Trials.
Biometrika, Vol. 62., 1975, pp. 195-200
[6] D.A. Schoenfeld A Simple Algorithm for Designing Group Sequential Trials. Biometrics,
Vol. 57., 2001, pp. 972-974
[7] L.D. Brown, T.T Cai, and A. DasGupta Confidence Intervals for a Binomial Proportion
and Asymptotic Expansions. The Annals of
Statistics, Vol. 30, No. 1, 2002, pp. 160-201
[8] D. Krackhardt Structure, culture and Simmelian ties in entrepreneurial firms. Social
Networks, Vol. 24, No. 3, 2002, pp. 279-290

You might also like