You are on page 1of 44

1

DETERMINING INFLUENTIAL USERS IN INTERNET SOCIAL NETWORKS

Michael Trusov Anand V. Bodapati Randolph E. Bucklin*

April 20, 2009

* Michael Trusov is Assistant Professor, Robert H. Smith School of Business, University of Maryland, College Park, MD 20742, phone: (301) 405-5878, fax: (301) 405-0146, e-mail: mtrusov@rhsmith.umd.edu, Anand Bodapati is Associate Professor, UCLA Anderson School, 110 Westwood Plaza, Los Angeles, CA 90095, phone: (310) 206-8624, fax: (310) 206-7422, e-mail: anand.bodapati@anderson.ucla.edu and Randolph E. Bucklin is Peter W. Mullin Professor, UCLA Anderson School, 110 Westwood Plaza, Los Angeles, CA 90095, phone: (310) 8257339, fax: (310) 206-7422, e-mail: rbucklin@anderson.ucla.edu.

Electronic copy available at: http://ssrn.com/abstract=1479689

2

DETERMINING INFLUENTIAL USERS IN INTERNET SOCIAL NETWORKS

ABSTRACT The success of Internet social networking sites depends on the number and activity levels of their user members. While users typically have numerous connections to other site members (i.e., “friends”), only a fraction of those “friends” may actually influence a member’s site usage. Since the influence of potentially hundreds of friends needs to be evaluated for each user, inferring precisely who is influential

– and therefore of managerial interest for advertising targeting and retention efforts – is difficult. We
develop an approach to determine which users have significant effects on the activities of others using the longitudinal records of members’ login activity. We propose a non-standard form of Bayesian shrinkage implemented in a Poisson regression. Instead of shrinking across panelists, strength is pooled across variables within the model for each user. The approach identifies the specific users who most influence others’ activity and does so considerably better than simpler alternatives. For the social networking site data, we find that, on average, about one-fifth of a user's friends actually influence his/her activity level on the site. Keywords: Internet, Social Networking, Bayesian Methods

Electronic copy available at: http://ssrn.com/abstract=1479689

3

DETERMINING INFLUENTIAL USERS IN INTERNET SOCIAL NETWORKS

INTRODUCTION In 1995, when the first notable social networking (SN) website, Classmates.com, was launched, few might have guessed that, 13 years later, SN sites would have tens of millions of users and would be valued at billions of dollars. SN sites currently attract more than 90% of all teenagers and young adults in the U.S. and have a market of about 80 million members. The cover of Business Week magazine’s issue dated December 12, 2005 suggested that the next generation of Americans could be called the MySpace generation. Emphasizing SN’s importance to marketing, the Marketing Science Institute designated “The Connected Customer” as its top research priority (Marketing Science Institute 2006). The core of an SN site is a collection of user profiles (Figure 1) where registered members can place information that they want to share with others. For the most part, users are involved in two kinds of activities on the site: they either create new content by editing their profiles (adding pictures, uploading music, writing blogs and messages, etc.), or they consume content created by others (looking at pictures, downloading music, reading blogs and messages, etc.). On most SN sites, users can add other users to their networks of “friends.” Usually, one user initiates the invitation and the other user accepts or rejects it. When accepted, the two profiles become linked. The most popular SN site business model is based on advertising. As users surf through a site, advertisements are displayed on the web pages delivered to the users. SN firms earn revenue from either showing ads to site visitors (impressions) or being paid for each click/action taken by site visitors in response to an ad. Consequently, user involvement with a site (e.g., time spent on the site, number of pages viewed, amount of personal information revealed) directly translates to firm revenue. SN firms have commonly used members’ profile information for ad targeting purposes.

4

The Role of Users in User-Generated Content Environments At social networking sites the content is almost entirely user-generated. To attract traffic, an SN firm itself cannot do much beyond periodic updates of site features and design elements. The bulk of digital content – the driving force of the site’s vitality and attractiveness – is produced by its users. Users, though, are not all created equal. Community members differ widely in terms of the frequency, volume, type, and quality of digital content generated and consumed. From a managerial perspective, understanding who keeps the SN site attractive – specifically, identifying those users who influence the site activity of others – is vital. Such understanding permits more precise ad targeting as well as retention efforts aimed at sustaining and/or increasing the activity of influential existing users (and therefore future ad revenue). The importance of identifying influential users on a social networking site has been recently highlighted by Google’s efforts to improve ad targeting at MySpace. Apparently disappointed with the returns from its recent deal to place advertising on MySpace, Google is developing algorithms to improve the identification of influential users (Green 2008). The idea is to target ads at site members who get the most attention, not simply those with certain characteristics in their profiles. Ad buyers have indicated that they would pay premium rates for this type of influence-based targeting. Google has filed a patent application, having apparently adapted its PageRank algorithm to the problem (Green 2008). The financial implications of improved ad targeting are significant because display ad pricing varies widely with audience attractiveness. For example, Walsh (2008) reports cost per thousand impressions varying from 5 cents to 80 cents or more. Our objective is to develop and test a methodology to identify influential users in online social networks based on a simple metric of their activity level. In this paper, we consider a user to be “influential” in a social network if his/her activity level, as captured by site logins over time, has a significant effect on others’ activity levels and, consequently, on the site’s overall page view volume. An example of this type of influence was given by Holmes (2006) who reports that when a popular blogger

while a lower number of logins implies lower usage. the site’s visitor tally fell and content produced by three invited substitute bloggers could not stem the decline.5 left his blogging site for a two-week vacation. we assume that during high-usage days. users’ online activities can be tracked and recorded. Empirical Inference of Site Usage Influence Most existing studies on social influence adopt a survey approach. Users log in to the site to consume new digital content produced by other users. Also. In this study. the last login date and time or time stamped content updates). users can infer how active a specific individual has been on the site and update their expectations for future activity. as both processes – content consumption and content creation – constitute site usage. for an online community with millions of members surveys are problematic. For example. Accordingly. the user has more opportunities to produce more content than during low usage days. If a member . in computer-mediated environments. From the traces left by others (e. if. the user observed an increase in volume and/or update frequency of new content. The model we develop infers site usage influence from secondary data on member login activity. Our approach is based on the following logic. treating the frequency of logins as a proxy for usage. We can use the data to ascertain the effect of any change in a user’s behavior (either increasing or decreasing usage) on the behavior of those linked to him or her. we tracked daily login activities of anonymous community members. User expectations are formed from recent past experiences.g. Logging in is more attractive when a user expects that there is likely to be new content to view. A higher number of logins per day is taken to be a sign of higher usage. In our dataset. he or she may choose to wait less time before logging in again. While questionnaires may work well for small groups. It can be easily extended to other online activity measures that are potentially available to SN site managers such as the amount of time spent on the web site and number of messages sent. for the past few login occasions. Fortunately.. which comes from a major social networking site. we use the number of logins per day as an effective correlate for these other measures. we propose that a member’s site usage level at each point is driven by his or her expectation about the volume and update frequency of relevant new content created by others.

we propose to use a different type of Bayesian shrinkage in which strength is pooled across variables within an individual rather than across individuals. price) for a given panelist is related to the effect that the same variable has on other panelists. Thus. in turn. Then. Note that this formalization of “influence” fits well with the firm’s business objectives. and. then we propose that this person is not influential. we seek to identify users whose behavior on the site has the most significant impact on the behavior of others in the network.6 increases his or her usage.. and it is the friends’ activity levels which make up the variable set. The balance of this paper is organized as follows. UPC scanner panel data is to use Bayesian shrinkage. according to studies in sociology. This approach is grounded on the assumption that the effect of a certain variable (e. we touch on two streams of related literature: . The problem of identifying influential users with site activity data turns out to be difficult because the data are typically sparse relative to the number of effects that need to be evaluated. is experienced offline. each user has a unique set of friends. potentially hundreds. our empirical results show significant heterogeneity among users on two dimensions: susceptibility to influence from other users and the extent of influence on others in the network. influences few people. With the data and this notion of influence. The average user is influenced by relatively few other network members. say. In the next section we provide a brief overview of the study context and introduce key terminology. Our findings indicate that social influence in an online community is similar to what. At a social networking site.g. simple approaches to effects estimation do not do well (as demonstrated later in this paper). A common way to address the sparse number of individual-level observations in. this type of Bayesian shrinkage cannot be used here because the variable set differs across users. As expected. Unfortunately. Also. having many “friends” (linked profiles) does not make users influential per se. and the people connected also increase their usage (due possibly to their interest in what this person is creating). To handle this. if a member’s usage goes up or down and usage does not change among the people connected to him or her. we propose that this identifies this person as influential. Managers need to know who stimulates activity levels among other members of the network. On the other hand. where strength is pooled across panelists.

John. In this sense Allison’s online activity influences their behavior. see Iacobucci (1996) and Van Den Bulte and Wuyts (2007). The study of relationships among interacting units is a cornerstone of social network analysis (SNA). Among these friends there are just a few people who actually make the site attractive to Allison. model formulation and empirical estimation on field data from a social networking site. Alberto. 1 .7 social influence and online communities. these methods are specifically geared toward an investigation of the relational aspects of these structures (Scott 1992). while tending to ignore updates in other connected profiles. undirected graph with binary edges. the importance of an individual actor (in this case. THE STUDY CONTEXT AND KEY TERMINOLOGY A profile-holder at a SN site can acquire new “friends” by browsing and searching the site and sending requests to be added as a friend. and Gert. We continue with a description of the proposed method. Next. Allison (Figure 2). She comes back to the site looking for new content produced by them. from the perspective of some of her friends. We then discuss managerial implications and illustrate the potential financial benefits of applying our approach versus some naïve methods. Bret. Let us focus on a specific individual in a hypothetical network. using simulated data. From Allison’s perspective. Allison also might be important.1 Usually. We conclude with a summary. It is possible that some of Allison’s friends are regularly checking for content produced by her or are motivated to contribute new content in the hope that Allison would view it. “friends” (the people with whom Allison has exchanged invitations) are Kate. Allison’s ego-centered network is a network of her “friends”. Ana. In turn. a community member) can be inferred from For a wide overview of SNA applications in marketing. SNA is a set of theories and methods that enable the analysis of social structures. Stan. Rene. In Figure 2a. The resulting “friendship” network can be represented by a connected. Our goal in this study is to develop a method to estimate the extent and the direction of the influence associated with each edge in the graph. She also updates her profile and posts new content motivated by the expectation that her important friends will view these updates. we assess the ability of our model to recover true levels of user influence. these are “important” friends. limitations and directions for future research.

This would likely imply that an individual who has more linked profiles is more important than an individual with fewer links. 2001). Iacobucci and Hopkins 1992. or in place of. a better input for SNA techniques would be a network where the link from actor A to actor B has a weight proportional to user A’s influence on user B’s site usage. Iacobucci 1998). attitudes.. for many SN site members. in addition to. Iacobucci 1990. scholarly research on social and communication networks.2 Analyzing the network of “friendship” links. Therefore. or beliefs to the behavior. As these links are easily observable by the firm. network connections and non-connections alone. the network is based on “friendship” links established through exchange of electronic invitations. those based on. but is based on information about other people (Robins et al. people often accept invitations from others just to avoid seeming impolite. however. Our approach is intended to infer such a network. opinion leadership. In an online community. might not be the best alternative when a firm wants to know who is important in terms of influencing site usage. LITERATURE Social influence occurs when an individual adapts his or her behavior. we need to use a network where the relationships represent influence on site usage. Social influence has been the subject of more than 70 marketing studies since the 1960s. SNA tools can then be used to evaluate an individual’s importance in this influence-based network. Overall. attitudes. having their profiles linked to a large number of other users is a matter of prestige or competition rather than a sign of importance or popularity. source credibility. Numerous anecdotal examples from press and personal interviews with SN site participants suggest that.8 his or her location in the network (e.g.3 To study a user’s influence. A link between two profiles on a social networking site does not necessarily imply influence. a network of “friends” might consist of a network of total strangers with whom almost no interaction is taking place. Influence does not necessarily require face-to-face interaction. information is passed among individuals in the form of digital content. Also. and diffusion of innovations has long demonstrated that consumers influence other consumers (Phelps et al. or beliefs of others in the social system (Leenders 1997). Social network analysis can also use other structural measures to flag possibly “important” network players. it might be tempting to apply SNA to infer an individual’s importance. Thus. 3 2 . say.On most SN sites. 2005).

the focus of our study. Dellarocas (2005) analyzed how the strategic manipulation of Internet opinion forums affects the payoffs to consumers and firms in markets of vertically differentiated experience goods. are too numerous to treat them the same way. Kozinets (2002) developed a new approach to collecting and interpreting data obtained from consumers’ discussions in online forums. from a methodological perspective.com. online communities have attracted the attention of many scholars.g. the dissimilarities (e. .9 Here. the motives for and nature of interactions. we consider a particular type of social influence that takes place in an online community: when members change their site usage in response to changes in the behavior of other members. The authors modeled the formation of relationships of “trust” that consumers develop with other consumers whose online product reviews they consistently find to be valuable. Our research seeks to contribute at the intersection of social influence and online communities. Second. Finally.. 2004. Godes and Mayzlin 2004). the revenue-generating models). Epinions. Though a relatively new area in marketing research. The authors tested the proposed model using a survey-based study across several virtual communities. Finally. Stephen and Toubia (2009) looked at a large online social commerce marketplace and studied economic value implications of link formation among sellers. They have either focused on interaction within a dyad (e..g. Dholakia et al.. Some aspects of socializing in the virtual worlds of MySpace or Facebook are similar to the online bulletin board type of interactions found on movie or consumer product review sites (the most common type of online communities studied in the marketing literature). Narayan and Yang (2006) studied a popular online provider of comparison-shopping services. the number of people involved. Godes and Mayzlin (2004) and Chevalier and Mayzlin (2006) looked at the effect of online word-of-mouth communications. previous research has not examined peer influence on individual-level site usage. However. we believe that online social networking sites are a unique type of online community. First. Narayan and Yang 2006) or considered aggregated group-level measures (e. (2004) studied two key group-level determinants of virtual community participation: group norms and social identity. none of the abovementioned empirical studies has simultaneously modeled individual-level influence within a group of users. Dholakia et al.g.

an absence of invitation-based connections between two profiles often imposes serious limitations on a level of interaction (e. we consider a user’s activity to be independent of the activity of second-level friends conditional on the activity of first-level friends. however. a closely analogous assumption is typically made in spatial statistics models. we used a pre-filtering condition that significantly reduces the number of potential connections to be evaluated. A person is said to be part of a user’s “first-level network” of friends if the person has a friendship connection with the user in the sense described above. (Note that this assumption would be flawed if the network structure was misspecified. Rather. For example. but unfortunately are not available to us in the dataset. 4 . Formally. it means that a second-level friend has an effect only via a first-level friend. most members never interact and are not even aware of each other’s existence. In a typical large online community. as when Allison Some other alternatives. The person is taken to be in the user’s “second-level network” of friends if he is not part of the user’s first-level network but is in the first-level network of someone else who is in the user’s first-level network. we would need to evaluate N*(N -1)/2 possible pairs of users on two dimensions: direction and strength. For a real-world network with millions of members. For an online community with N members. This does not imply that a secondlevel friend has no effect on a user. this says that first-level friends’ activities represent sufficient statistics for other user’s activities.10 MODELING AND ESTIMATION Our objective is to estimate the influence that each SN site member has on the site usage of other members. ability to consume and/or exchange digital content). we estimate the direction and strength of an influence only for profiles linked by a “friendship” connection.g.4 Accordingly. This is a natural assumption to make. because the bulk of the interaction on a typical SN site occurs among profiles that have been connected through invitations. Allison may have an effect on Emily but Alberto’s activities are sufficient for characterizing this effect. This limitation of the dataset does not present a serious problem in our research.. given the nature of the network. this task can be both infeasible and unnecessary. might also be considered as candidates for the pre-filtering condition. Higher level networks are defined in similar fashion. We argue that a good candidate for a pre-filtering condition is the existence of an explicit connection between profiles established through an invitation mechanism. In our modeling approach. in Figure 2a. such as cross-profile visitations and message exchanges. To take advantage of this sparseness. Moreover.

and Gert is Allison’s influence on them. in Figure 2a. and a weak impact on Alberto. Ana. Model Specification To identify influential friends within an ego-centered network. a byproduct of ego-centered analysis for Kate. we model the number of daily logins yut as a Poisson regression where (1) yut ~ Poisson(λut ) . John. a moderate impact on Kate and John. Bret.) The assumption of conditional independence from second. we search for influential friends within his or her ego-centered network. Kate. In Figure 2b. and Gert – on her site usage. We suggest that the count of individual daily logins follows a Poisson distribution with rate parameter λut. “Who is influenced by the same given user?” In network terms. Alberto. Rene. We group the predictors into two sets: self-effects and friend-effects. which may vary across users (u) and time (t). Alberto. Stan. “Who is influencing a given user?” For example. we model a user’s login activity as a function of the user’s characteristics. The outcome is that Bret. and John do influence Allison to different degrees. Ana. Bret. Rene. each user is treated as an “ego” once and as a potential “influencer” the number of times equal to the number of “friends” he or she has. while the others have no impact on her. This answers the question. Cameron and Trivedi 1999). John. This answers a second question. the user’s past behavior on the site.11 and Emily know each other directly outside of the online social network and there are interactions other than those through Alberto. The Poisson regression model is derived from the Poisson distribution by specifying the relation between the rate parameter λut and predictors (e. Repeating this process for every individual in the network. the procedure reconstructs a full influence-based network by performing a series of ego-centered estimations.and higher-level friends allows us to adopt an ego-centered approach to the analysis. Self-effects include such covariates as user-specific . Stan. Allison is shown to have a strong impact on Gert.. Accordingly. and the login activity of the user’s friends. we treat Allison as an ego and evaluate the impact of her friends – Kate.g. Taking one user of an online community at a time.

zuft = weighted average of lagged login activities of friend f of user u at time t.. on the site usage of this individual’s friends. This is given in equation 4: (4) zuft = ∑ wu (d ) × y f (t − d ) . intercept. logins at t-1). giving the logarithm of the rate parameter as (2) log(λut ) = Self Effectsut + Friend Effectsut . following the spirit of Guadagni and Little (1983). Fu = number of friends of user u. We define wu(d) (equation 5) as a function of lag d and a smoothing parameter ρ u : (5) wu (d ) = exp(− d × ρu ) ∑ exp(−k × ρ k =1 D u ) . + α uK xuKt + β u1 zu1t + β u 2 zu 2t + . More specifically. then the corresponding coefficient β uk will be significantly different from zero.. If friend k is among the user’s “important” friends (i.. where d =1 D ∑ w (d ) = 1 . among other things.. Thus. day of the week. and past logins. and β uf = coefficient of friend f for user u.e. d =1 u D We also adopt an exponential smoothing expression for wu(d). for each friend f of user u we construct a covariate zuft as a weighted average (wu(d)) of friend f’s login activities over the past D days. It is customary to use exponential rate parameterization. + β uFu zuFu t α uk = coefficient of user-specific covariate k. Friend-effects consist of friends’ lagged login activity. Equation 3 specifies that a user’s site usage at any given point in time depends. log(λut ) = α u1 xu1t + α u 2 xu 2t + . his or her activity level has an impact on the site attractiveness for the user)..12 intercepts. day-of-the-week effect. we have (3) where xukt = user-specific covariate k (e. We note that learning about the activity levels among one’s friends is likely to take time.g. The proposed model captures this process through the friend-specific coefficients βuf..

is slow. We benchmarked the performance of the Poisson model against the more flexible NBD specification (sometimes preferred to the Poisson because of the equi-dispersion assumption) and did not find any significant difference in the estimation results. the estimated directions of influence between users are primarily unidirectional.g.13 We believe that the past actions of friend f are likely to have a diminishing-in-time impact on user u’s activity level at time t. One limitation of our model is that explosion is a theoretical possibility. So. a situation where one user amplifies another user’s future activity and this other user in turn amplifies the first user’s activity. we chose the Poisson regression because it integrates well with our variable selection algorithm (described below) and results in a parsimonious and scalable solution – critical in real-world applications.6 We believe that the characteristics of our data and estimation results make the risks of this relatively small. We allow for full heterogeneity in the smoothing parameter across users. i. for instance. Figure 3 plots a few examples of wu(d) for different values of ρ u . 5 While we can also estimate D as a part of the model. We preset D to be 7 days. . we found it impractical because it slows down the estimation procedure without providing any significant benefits. The flexibility of this approach allows the weight distribution over the past D days to vary from being equal for all lags when ρu → 0 to the entire mass concentrating on the first lag for larger ρ u (e.e.. we expect wu(d) to be a decreasing function in d. when ρ u equals 5 the first lag receives 99. 5 Among several alternative specifications for count data models. Second. when user u influences user u'. we observe that user activity levels are roughly equivalent at the beginning and end of the 12-week observation period. Consider. This suggests that explosion.3% of the weight). This also greatly diminishes the likelihood of explosion. if any. This situation can create a positive feedback loop that causes each user’s activity level to increase indefinitely. results are not sensitive to changes in D for values greater than 5. however. First. user u' does not influence user u.. 6 We thank an anonymous reviewer for making this important point. We also acknowledge that D could be made user-specific.

g.g. the variables correspond to friends and different users have different sets of friends. price and advertising). Thus. across users. we propose to shrink not across users but across friends within a user. In the scanner panel situation. This makes it prudent to choose the acrossfriends distribution of the βuf terms to have a point mass at zero. for a given user. whose effects we are trying to determine. we encounter the “large p small n” situation where the number of parameters to be estimated is large relative to the number of observations. (We actually see this effect in the simulation results described below. This means that the variable set differs. often completely. A problem with this approach is that the activity levels of a user’s friends are often correlated. A key consideration is that the average user probably tracks just a few other “friends. This means that these parameters cannot be reliably estimated in a fixed effects framework. The modeling assumption is that the response coefficients for the various households are drawn from a distribution so that the coefficient values for other households tell us something about the coefficient values for the given household. In our setting. In the social network situation. we could consider a . price and advertising). For example.” Thus. To address this challenge. most of the βuf coefficients in equation 3 are likely to be zero.. however. we need to choose the across-friends distribution of the βuf terms in equation 3. Our panel has about 80 observations for each user. the typical user has about 90 friends and many have hundreds of friends. The bias from omitting the effects of the user’s other friends is likely to produce inflated estimates for the influence of each friend. the variable set (e.. is constant across households. This has been widely done in the analysis of scanner panel data when estimating a given household’s response coefficients for marketing variables (e. To do this. One way to address this is to estimate the model specified in equation 3 but only for each user-friend pair at a time (controlling for other user-specific covariates xuk).) An alternate approach is to use a random effects framework where we pool strength across parameter estimates from multiple samples via Bayesian shrinkage.14 Estimation Challenges In our data. we cannot apply the usual type of Bayesian shrinkage.

we will focus the following discussion on the two-class case. the analyst needs to select the number of mass points.. This approach enables us to parsimoniously capture two phenomena: (1) either a friend is influential or not. we could consider a latent class model. is 1 if friend f is influential and 0 otherwise. We then use this generating distribution to pool strength across friends to estimate the values of the influence coefficients for each specific friend. the Bayes Factor. or cross-validation. We need to estimate the location of this other mass point and its weight. denoted by βu. βu .∀f | • ∝ Bin(γ uf | cuf cuf + duf ) . We now describe how γuf is drawn because it is an unconventional part of our Gibbs sampler: (7) γ uf . 2002). To implement a latent class model. All the model parameters can be drawn from the conditional posterior densities given the other parameters’ values and these are detailed in the Appendix. i. so that βuf becomes either zero or not and (2) the susceptibility to friends’ influence. (In the Appendix we describe the algorithm for the general case with more than two point masses.e. Thus. where the mixture density consists of multiple point-mass densities with one point located at zero. this selection can be based on a number of well known criteria including the Deviance Information Criterion (Spiegelhalter et al.15 mixture density where one component has a point mass at zero and another component is a Gaussian distribution with a mean and variance to be estimated. Alternatively. can vary from user to user. In our empirical application. We elect to pursue the latent class approach here primarily because the estimation algorithm is of high computational efficiency. γuf . Fortunately.) We can decompose each friend-specific coefficient from equation 3 into a user’s susceptibility to friends’ influence. This also is important for scalability in the applied use of our approach. and a binary parameter γuf : (6) β uf = β u × γ uf The binary parameter. we assume that the across-friends distribution of influence coefficients β uf has a mass point at zero and another somewhere else.. models with two or three point masses perform about the same (as detailed later).

The probability puf could be a function of γfu ..… zuFt] and vector γu= [γu1. From the perspective of variable selection. d uf = (1 − pu ) × Lu (i. a friend-specific γuf is drawn as a Bernoulli random variable. We recognize that treating influence in a dyad as independent may result in biased estimates. with γuf =0). with a term puf that is specific to each friend f.16 where cuf = pu × Lu (i. . f =1 Fu 7 The model defined in equation 7 does not control for a possible reciprocity of influence between user u and friend f. The IFu can be interpreted as the number of influential friends. The behavioral interpretation is that friend f has a non-zero influence on user u. We plan to address this issue in future research. sufficient statistics for βuf become an inner product of vector zut = [zu1t. with γuf =1) to model likelihood without friend f (i.… γuF]. the posterior mean of γuf is a probability that the corresponding covariate zuf should be included in the model. zu2t.IFu ). Lu = the Poisson likelihood function for user u. conditional on γu.7 To apply our approach at a social networking site. we might need to estimate millions of egocentered networks. γu2. Accordingly. we could in equation (7) replace pu. The pu term is updated by drawing from the beta distribution Beta(1+IFu . + α uK xuKt + β u ∑ γ uf zuft . with the success probability based on the ratio of the likelihood with friend f ’s effect included (i. γ uf = 1) . In each iteration of the Gibbs sampler. the friend-independent prior probability of friend being influential. which has a mean that is approximately equal to the empirical fraction of influential friends. and pu = prior probability of friend f being influential for user u (estimated in the sampler). As a possible solution. Indeed. 1+Fu . equation 3 can be rewritten as follows: (8) log(λut ) = α u1 xu1t + α u 2 xu 2t + .e. hence making the stochastic realizations γuf a function of γfu.. γ uf = 0) . Updating β coefficients collapses to a simple Poisson regression with a small number of parameters. The proposed decomposition in equation 6 results in a very scalable solution. Let IFu denote the sum of all the γuf terms for any particular user u... accommodating reciprocity.e.

networking goals. their 29. The average number of logins per day in our sample is 2.298. user #118 logs in an average of 4.g. we use the Metropolis-Hastings algorithm within Gibbs sampling steps. To complete the model specification.478 friends. ILLUSTRATING THE METHODOLOGY WITH FIELD DATA We apply our model to data obtained from one of the major social networking sites.48. We refer to these groups as level 1.. The examples illustrate how site usage varies considerably from one user to another: e. we introduce priors over the parameters common to all users (see Appendix). where the proposal density is a normal approximation to the posterior density from the Poisson likelihood. we observe full profile information (e. level 2. after convergence. we have information on login activity but no profile information. This means that the proposal density is adaptive in that it varies from iteration to iteration as a function of the γu values.4 times per day. We burn 50.000 draws and simulate an additional 50. The proposal density for α uk and β u is a normal distribution centered at the maximum likelihood estimate for equation 8.. we track daily login activities for a random sample of 330 users. education. and level 3 networks. Model Estimation We estimate our model using the above described Bayesian approach.g. with the variance equal to the Inverse Fisher information matrix.17 To draw α uk and β u . Instead of the usual random walk. zip code).g.779 friends’ friends. The likelihood considered is conditional on the realizations of γu on that iteration. age. which wishes to remain anonymous. we allow long chains to run. number of profile views) as well as self-reported demographics (e. Figure 4 gives examples of login time series for four randomly selected users in our sample. income. In the 12-week dataset. We monitor chains for convergence and. and their 2.. For level 3 users. For level 1 and level 2 network members. .000. we use an independence chain sampler. number of friends.3 times per day while user #12 logs in an average of 1. implemented with MCMC methods. Each bar on the graphs corresponds to the number of logins on a specific date for a specific individual.

Model 2 includes the effect of all of the user’s friends – i. On an individual level. Model 3 is our proposed model of equation 3. Bars on these graphs correspond to the posterior probability that a particular friend influences the user.e. all of the γuf are set equal to 1. we show two users with a similar number of friends. Second. we focus on the 330 level 1 network users. Model 1 incorporates self-effects only. and to show how profile information can be used to explain these variations. As shown in Table 1. members of the level 1 network are (a) “friends” of level 2 network users and (b) members of the level 3 network. In other words. the proposed model provides a significantly better fit to the data compared to the two benchmarks.. In Figure 5. Note that. Level 1 Network Estimation Results In addition to equation 3. we perform ego-centered estimations for all users in the level 1 and 2 networks. The objectives are to highlight the importance of peer effects in predicting individual user behavior. For the first user. For model fit and comparisons. Due to the limited information on the level 3 network. we perform ego-centered analyses for level 2 network users. we observe a few influential friends. we estimate two benchmark models. by construction. but quite distinct patterns for influential friends.18 Estimation Results Using the model in equation 3. to demonstrate variations in the probability of influence across friends. we present our findings in the following sequence. Figure 6 shows the empirical distribution of the posterior mean γuf for the entire sample. we use the Deviance Information Criterion (DIC). . we observe the anticipated heterogeneity among users in terms of the number of influential friends. which indicates that most of the posterior means γuf are relatively small. while for the second there are none. The distribution is considerably skewed to the left. friends corresponding to these small γuf have a very low probability of influencing site usage. As a by-product of this analysis. we obtain estimates for the influence of level 1 users on level 2 users. about 100 each. First.

19 A key construct of interest is ∑γ f =1 Fu uf . Users who have been members of the site longer have more influential member friends. A user is more influenced by a member friend of the The model defined in equations 1 through 4 can be modified to incorporate profile data in a hierarchical fashion by adding priors on γuf. and relative age. then model 2 should be better than model 3. the total number of influential friends for user u. The probability of being influential in a dyad (γuf) might be explained by static measures available from user profiles. Behaviorally. The expression for this expected value is: (9) Influ = γ u f =1 ∫ ∑γ Fu uf × f (γ )d γ u . same ethnicity for user and friend. while a sample average Influ is about 10 people. To investigate this. we also compute Su = Influ /Fu .which is a point-estimate for the fraction of friends that influence user u. We leave this extension for future research. A good point estimate for this construct is given by computing its expected value over f (γ ) . We find that female friends tend to influence the site usage of male users more so than other gender combinations. This integral is estimated by Monte Carlo using the draws of γuf over the MCMC iterations. 8 . we conduct an exploratory posterior analysis.8 We regress the logit-transformed posterior values for γuf on several covariates extracted from the user profiles: gender combination for the dyad (in this case we chose male user. If a user has no friends influencing him or her. To aid interpretability. A sample average Su is approximately 22%. this suggests that 32% of users do not have any “influential” friends whose login activity on the site would help to explain variations in their site usage. the posterior joint density of all the gamma terms across friends. about one out of five friends within an ego-centered network significantly impacts the login decisions of ego. The DIC criterion (calculated individually for each ego-centered network) shows no decrement in fit for 32% of individuals in the level 1 network as we go from model 3 to model 2. On average. months the friend has been a member. user’s dating objective. female friend). Table 2 gives the results from the regression.

Green (2008) raises the notion of computing a user’s “Google number” – the sum total of an individual’s influence on others. We note.20 same ethnicity. in part. In a similar spirit. that the profile data analyzed here explains relatively little of the total variation in influence – the R2 for the regression is 0. The empirical distribution has a bimodal shape. however. by the login pattern of a corresponding user.67). Therefore.25 (weights slowly decrease with lag). On the other hand. This suggests that there may be serious limitations to using profile information alone for advertising and retention targeting. friends who are older than the user have less influence. We find that for users with high (higher means) and stable (lower variances) login activities more weight is assigned to the recent activity9. Variation in ρ u may be explained. Level 2 Network Estimation Results In discussing Google’s patent application for ranking influence. We also looked at the distribution of the smoothing parameter ρ u across users (Figure 7). Behaviorally.18 (t=8. These exploratory results suggest that posterior analysis based on profile information could help the firm predict the probability of influence for any dyad in the network (since all members provide profile data at signup). we first estimate 9 The regression coefficient for the mean of daily logins is 0. . we regressed ρ u on the mean and the variance of daily logins (calculated on a holdout sample).11 and it does not improve with the addition of other profile variables. as some practitioners have indicated (Green 2008). less active or irregular users have weights more evenly distributed across lags. with one mass point at approximately 2. we need to evaluate his or her impact on egos within each ego-centered network of which he or she is a member. To investigate this. this suggests that it may take less time for “regular” users to learn about changes in friends’ behavior and to form new expectations regarding content updates.05) and the coefficient for the variance of daily logins is 0. to estimate an individual’s influence in an online community. Finally.05 (t=5. Users with dating objectives have fewer friends who are influential.2 (corresponds to 90% of weight assigned to lag 1) and the second mass point at approximately 0.

we calculate the network influence (Iu) as a sum of the marginal impacts he or she has on egos (e) in the level 2 network (equation 11):10 (11) Iu = ∑ Fu e =1 γ ue βue ∫∫y e × βue × γ ue × f (γ ue . the average response effect to a one-unit change in regressor can be calculated as n n follows: 1 ∂E [ yi | xi ] = β × 1 exp( x ' β ) . . Our findings indicate that the majority of users have very little impact on the behavior of others.. For example. The predictor variables correspond to the lagged activity levels of a user’s friends and the 10 For the Poisson regression. As expected. β ue ) = the posterior distribution of γ ue and β ue . ILLUSTRATING THE METHODOLOGY WITH SIMULATED DATA We also used simulated data to examine the performance of our proposed Bayesian shrinkage approach.e. Fu = the number of ego-centered networks user u enters as a “friend” (i. For the Poisson regression model with intercept included. the number of friends of user u). do show a significant influence. Specifically.21 ego-centered networks of all level 2 users. f (γ ue . this can be ˆ ˆ ∑ ∂x ∑ j i n i =1 n i =1 i ˆ shown to simplify to y × β (Cameron and Trivedi 1999). Then. Some. some users with similar numbers of friends have total network impacts which differ by a factor of eight. βue )d γ ue d β ue . The variable set for the simulation consists of two parts: the predictor variables and the response variable. however. The simulated datasets are designed to be in the stochastic neighborhood of the data used in our field application. the extent of this influence also varies considerably across users. for each user (u) of network 1. These results reflect the ability of our method to identify influential users in the network and to quantify the likely extent of this influence. we are interested in assessing how well our procedure recovers the identity of influential users under varying conditions. where ye = the average number of daily logins of user e.

dropping the first 8000 iterations as burn-in. We also vary T. the number of time-periods observed. we randomly select Fu of that user’s friends and take the corresponding activity data to be the predictor variables. we applied our estimation procedure to the corresponding simulated data. The predictor variables in our simulation take values corresponding to the empirically observed lagged activity levels of a user’s set of friends. For each of the 20 users in each of the 81 cells. is 85 in the empirical dataset. For each cell. and the β u . The response variable’s values are then simulated from the paper’s Poisson model based on the predictor variables and on assumed values for the parameters of the model. and Fu the number of friends (both influential and non-influential) that user u has. we randomly choose 20 users with a number of friends greater than or equal to the value of Fu in that cell. How well the model differentiates between influential ( γ uf = 1 ) and non-influential ( γ uf = 0 ) users. then half the observations are an exact replication of the 85 time-periods in the empirical data and the other half are from a random sampling with replacement. If the T value in a cell is 40. which represent whether or not friend f is influential on the focal user u. Simulation Results We computed and now discuss the following three performance measures: 1. How well the model recovers the share of influential friends (pu). We ran our samplers for 10000 iterations. the number of time-periods observed. 2. If the T value is 170 in a cell. The term pu represents the fraction of user u’s friends who are influential on the user. The model parameters are the γ uf .22 response variable corresponds to the activity level of a specific focal user. which represent the influence strength of a friend who is influential. The simulation datasets are drawn according to a 3×3×3×3 design (81 simulation settings in total) produced by manipulating the four factors specified in Table 3. We vary these model parameters over multiple replications. . The middle level of each factor has been chosen to match the value obtained for the corresponding parameter from our field application. For each user. The value of T. then we randomly sub-select 40 of the 85 time-periods.

90. pu. We considered this threshold only to understand how much worse the simple C0. The other is to set C to minimize the total misclassification error in predicting influentials on a certain holdout sample for that cell. For the fraction of influential friends. 92. we label this the C0. .5 threshold would be relative to the best possible choice. the optimal threshold yields only slightly better performance than the simple threshold of ½ and we therefore drop Copt from further discussion.1 and 82% for pu= 0.05. Specifically. The first is to set C to ½. We consider two ways of setting C.5 threshold and 92% for the Copt threshold. As the proportion of influential friends rises. The prediction is based on the posterior mean γ uf . Thus.5 threshold and between 77% and 100% for the Copt threshold. We assess how well the model predicts whether or not a ˆ certain friend is influential. How well the model recovers the strength of an influential friend’s influence β u . it becomes somewhat harder to distinguish them from non-influential ones. were not statistically significant (at least for the cells considered in our design).9% for Fu=180. as the number of friends increases. it becomes harder to distinguish influential from non-influential friends. Our specific experiment design was chosen to match the field data’s structure and it is important to keep in mind that these patterns may not extend to different designs. the mean correct classification rate was 96. we found strong effects. we would not be able to set the value of C in the latter way because we do not observe the true value of γ uf .23 3. The main effects for the first two factors. Across all 81 cells. We also conducted an ANOVA analysis to explore the differences in the correct classification rate across the 81 cells. The average fraction is 90% for the C0. the mean for the correct classification rate was 95.4% for Fu=45. For the main effect Fu . Thus.3% for pu= 0.2.9% for pu= 0. we label this the Copt threshold. the fraction of correctly classified friends varies between 68% and 100% for the C0.5 threshold.8% for Fu=90 and 84. each significant well beyond the 0. number of friends. For the other two factors (Fu and pu). we predict ˆ that friend f is influential if γ uf exceeds a threshold C.01 level. In actual practice. β u and T. Measure 1: Identifying the influentials.

Across all users in all 81 cells. T. non-Bayesian alternative. We now report how well our methodology recovers the fraction of influential friends.. we estimate the Poisson regression for each user-friend pair controlling for other user-specific covariates. this is 21% on average. The β u parameter represents the influence strength of influential friends. Using the simulated data constructed above. We report the mean value of β u for each of the three factor levels averaged over all of the 27 cells with a particular (true) value for β u . The mean absolute value of the difference (MAD) between the actual and the estimated fraction is 0.88. Estimating the influence strength of influential friends. Comparison with Alternative Method We used our simulation to compare the performance of our proposed Bayesian shrinkage approach to a simpler. the correlation between the true fraction and the estimated fraction is 0. this fraction is estimated as Su = Influ /Fu.24 Estimating the share of influential friends. An ANOVA showed that the main effects for the four factors were not strongly significant. The error is strongly affected by the number of observations. Specifically. Dividing the absolute error by the true value gives the relative error. where Influ = ∫ ∑γ γ u Fu f =1 uf × f (γ )d γ u and the integration is over the posterior density of the γ uf terms. a figure we consider to be reasonably low given the difficulty of the problem. The relative error is about 29% for T=40 days. In this simpler approach. we use t-statistics to identify the subset of friends whose logins significantly explain the focal user’s site activity. and 14% for T=170 days. Table 4 summarizes the recovery results for the β u parameter at each ˆ level in the simulation design.05. This suggests that our accuracy in estimating a user’s β u goes up moderately quickly as additional days of activity are recorded and available for analysis. 20% for T=85 days. Table 4 also reports the absolute error over the 27 cells. we estimate . we label this the univariate friend model. As we discussed above (see equation 9).

96 equal to zero. the simulation studies provide encouraging evidence for the performance of the proposed method for estimating influence levels for users in an online social network. using these β uf and assuming full fixed effects model (multivariate in number of friends Poisson regression) we generate the dependent variable for 85 observations in our actual sample. We also compared the performance of the proposed model versus a non-Bayesian alternative assuming that data simulation follows the latter. We simulated another 20 observations from the same empirical distribution to generate a hold out sample.96 as influential. We use the following simulation procedure to generate data from the alternative. the simulation results support the use of the more elaborate Bayesian modeling procedure for the purposes of identifying influential users. We classify friends with t β ≥ 1. Taken together. with an average of 65%. For all users in our level 1 network we estimate a univariate (in number of friends) Poisson regression model (controlling for self-effects).25 3×3×3×3×20×Fu Poisson regressions by maximum likelihood and calculate t-statistics ( t βuf ) for each β uf . In our dataset only 60 percent of users satisfy this condition. we estimate two models (univariate non-Bayesian whose parameters are indicated by superscript NB and the proposed Bayesian model indicated by superscript B) 11 We thank the Area Editor for this suggestion. uf Across the 81 simulation designs. The procedure assures that the simulated data are in the stochastic neighborhood of our field application but generated according to a non-Bayesian alternative model. 11 Under such a scenario the “true” model is a fixed effects model where each friend is associated with an individual β uf coefficient. This is substantially worse than the performance of the proposed Bayesian approach for which the average correct recovery rate is 90% (using the ½ threshold). Next. We set all β uf with t βuf ≥ 1. the univariate friend model correctly classified the influential friends between 46% and 92% of the time. . Note that. non-Bayesian model. a full fixed effects model can be estimated only for network members who have more login observations than friends. Finally. Thus. in practice.

MAD calculated on the out of sample predictions of y is about seven percent lower for the Bayesian approach (3.78 vs. which can offset the “bias” disadvantage that comes from the fact that the generating process is different from what is implicitly assumed by the Bayesian model. MANAGERIAL IMPLICATIONS FROM THE FIELD DATA ANALYSIS In this section. the correlation between an individual’s total marginal impact (our estimate of influence as revealed NB B NB B . In small samples. this surprised us because theory predicts that the data generating model should perform no worse than alternative models. Initially. For our sample of 330. To address this. The correlation between β uf and β uf is 0.08). First.066. The mean absolute difference between β uf and β uf is 0. we discuss implications for managers of applying the model in practice. while for β uf and β uf it is 0. we evaluate scenarios to quantify the financial value of retaining the site’s most influential users. One practical consideration is whether the Bayesian approach is likely to offer meaningful gains in the identification of the most influential users when compared to simple metrics. we ask how well the univariate friend model and some simple descriptors of user activity. the Bayesian can do better because it is a shrinkage estimator and therefore has a variance advantage. we evaluate how well other plausible measures of user importance do in capturing influence.014. The results show that insample recovery of β uf is better with the proposed Bayesian model. Second. Specifically.26 using the first 85 observations and compare in and out of sample performance. we look at the number of friends and the number of profile views (“hits”) for each user. such as friend count and profile views. capture the influence levels of the site’s most important users.59 while for β uf and β uf it is 0. the simulation shows that the proposed Bayesian model is robust under a data generation process which assumes heterogeneity in friend effects within a user. In sum.70. Finally. Both analyses show that the proposed Bayesian approach offers significant potential benefits to managers concerned with targeting users for advertising and retention. 4. But then we realized that the theoretical prediction is only an asymptotic result.

96) e =1 Fu . For example. but continue to reach below the top 10%.96 as influential.72. We compared how well the univariate friend model and the simple metrics of friend count and profile views predict influence at the top of the list. if we were to extend 12 Because both rankings use the same measure of individual marginal impact to calculate the cumulative top-k impact. This stems from the likelihood that managerial interest is focused on members with the highest levels of influence (e. considerable variance remains unexplained.13 Based on the foregoing analysis. it is a given that. We place the values for the top 5% (corresponding to the cumulative impact at user 16) in boldface for discussion purposes. As we noted above. in Table 5. which is however still substantially lower than the proposed model (61.12 The ratio is worse for profile views.53). the non-Bayesian alternative produces the best result out of three naïve rankings. In Table 5.53.49 versus 82. 13 The procedure closely follows the one for the Bayesian method except the selection criterion for influential users is based on the t-statistics calculated for each β uf from pair-wise Poisson regressions. At this point.. practitioners have been disappointed with the ability of simple metrics to capture influence on social network sites (Green 2008). the top 5 or 10 percent of users).37/82. The relative payoffs from targeting based on estimated network influence are greatest at the top of the site’s member list. Ranking is based on the total marginal impact which we calculate as follows: I u = ∑ ye × β ue × I (tβue ≥ 1. The correlation between total marginal impact and profile views is 0. Because of their social influence and impact on the site activity of others. ranking by friend count yields a substantially lower network impact (43.15 versus 82. We classify friends with t βuf ≥ 1. Therefore. While the number of friends and profile views both predict influence. we present data for the top 10 percent (33 users) of our sample and show the cumulative impact obtained by proceeding further down the list. the first list is by construction greater than the naïve ranking for each value of k. at 24. the numbers in the “Proposed Model” column will be greater than the numbers in the “Number of Friends” column. we believe that simpler metrics such as friend count and profile views are likely to be inadequate proxies for user influence.34.g.27 by the model and equation 11) and his or her total number of friends is 0. Finally.53). these top users are likely to be the most valuable to advertisers and most important for the site to retain as members (Green 2008). We also note that focusing on correlations alone could be potentially misleading in a practical sense.

While CPM on some premium sites can be as high as $15.com did historically with its “Cool New People” feature – picking users at random and showing their profiles on the site’s home page. In network settings. The negative impact of an influential user leaving the site is not limited to the lost revenues from. if there is cost associated with retention efforts. Rather. 2006). monetary rewards. Finally.). We suggest that this can be determined through a series of small-scale field experiments with different target groups.g. Therefore. We thank the Area Editor for this suggestion. This gives a popularity boost to the selected profiles. One simple approach would be to choose individuals at random. the average number of pages viewed on a community site by a unique visitor per month is about 130. the firm may consider targeting influential individuals identified by some empirical model – for this we compare the proposed model and a non-Bayesian alternative. Alternatively the site may focus on users who have many friends or users with a high number of profile views. In practice. most social networking sites have CPM under one dollar. Also. the firm needs to know whom to target. Price quotes from several social networking sites indicate that 40 cents per thousand impressions is a reasonable benchmark. but also of the effect that this customer has on other customers (Gupta et al.g. such as access to special features. the site usage of all linked (dependent) users will be affected as well. say. we look at the potential effects on advertising revenue from impressions. 5% of users) would involve addressing millions of user members at the major SN sites. even narrowly focused targets (e. 14 . an incentive for site usage stimulation.. This is actually similar to what MySpace. of course. etc. we would find that the top 33% of influencers are responsible for 66% of the total impact. ad impressions not served to this particular individual. Also. when determining how much a firm should be willing to spend to retain a particular customer.28 Table 5 to our entire sample of 330 users.. We now turn to implications of our model for managing retention efforts. (e. a more comprehensive targeting approach should take into account an individual’s propensity to leave the site with and without the program. According to Nielsen//NetRatings (2005). the user’s network influence should be part of the valuation. From what we None of these approaches tells us how responsive the selected individuals are to the firm’s retention efforts.14 To gauge returns from targeted retention. customer value to the firm is not solely a function of the cash flows generated by a customer.

17 which translates to about 10. corresponds to a drop of 209. If the drop in network activities persists.08 cents per day or $36. users visit the site an average of 2.5 million.48 logins per day instead of using individual level daily login averages. In the case of random selection. Targeting based on the number of friends and the number of profile views results in total payoffs of $57. go from 2. All of these are substantially below Since our analysis does not reveal any significant correlation between a number of daily logins and strength of network influence.15 In our sample of 330 users the loss of the top 5% of users.369 is the average predicted impact across individuals in our sample.48 times 1. Adding these up gives the payoff from retention actions in this scenario to be about $93. Finally. the average user contributes approximately 13 cents per month or $1. In addition. The main purpose of this example is to illustrate that different targeting approaches can lead to substantially different financial implications.81 a year (10. we use a sample average of 2. We can use the data in Table 5 to develop a numerical estimate of what would happen to the network if top users drop activity levels to zero (i.33). ranked by influence.16 million.175 cents. so each login generates about .369 times 17).54 cents per day. for 10 million users. Scaling up to the size of the actual network and adding users’ “stand-alone” value gives $36. This translates to about 36. 15 . Thus. 17 people correspond to approximately 5% of users in the sample. 17 1. given that the firm targets the top 5% of influential users as identified by our model.72 daily logins (2.15 logins in the network (2.65 million in payoff. the annual financial impact would be $78. which we do not discuss here.29 have observed across multiple social networking sites. From our data.50 a year of revenue from this source. the cumulative impact of losing 5% of users is a drop of 57. respectively. each user alone (ignoring network effects) contributes about $1.54 cents times 365 days).50 a year of impression-based ad revenue. To accurately infer the network impact caused by losing a customer the model needs to take into consideration several other factors.08 cents times 365 days). is $15 million.78 million and $38. targeting based on ranking results produced by the univariate friend model will bring a total payoff of $72.5 million.48 to 0 logins).16 We repeat this analysis for the four other targeting approaches. the total loss to the firm is roughly $133 a year (36.e.48 times per day. 16 We acknowledge that our approach is an oversimplification. Scaling this number to match the size of the target group of an actual network (approximately 10 million users in the case of MySpace).48 times 84.7 million. which.. the average page carries about 2 to 3 ads.

Distinguishing weak links from "strong" links is a difficult problem for two reasons.30 the $93.. has not yet done this in an application to massive right-hand-side expressions. Our primary intended contribution is a methodology for extracting. gender. We also found that descriptors from user profiles (e. we develop and test a non-standard Bayesian shrinkage approach. stated dating objectives. application of the Bayesian approach might significantly improve the ability of social network sites to target their most influential members and to design and implement cost-effective retention programs. drawing from the literature on variable subset selection. This sets up a challenging P>N problem. the strong links out of a large overt network which has mostly weak links. existing research. when compared to simpler alternatives. In sum. . To the best of our knowledge. say in less than three months. To address this. etc. CONCLUSION Firms operating social networking sites see an "overt" network of friends. the number of overt links is very large. with limited data.) lack the power to enable us to determine who. we implement a highly scalable Poisson regression model which shrinks influence estimates across friends within users.g. First.5 million associated with targeting from the proposed model. The spirit of this finding is corroborated by Google’s recent efforts to better quantify social influence so that it might extract more revenue from targeted advertising on MySpace. Specifically. Second. where the shrinkage is done across predictors within a model. Most of the links in this network are "weak" in the sense that the relationships do not significantly impact behavior in the network. defined according to who added whom as a friend. per se. we found that relatively few so-called “friends” are actually significant influencers of a given user’s behavior (22 percent is the sample mean) while substantial heterogeneity across users also exists. We tested the model on field data provided by an anonymous social networking site. so the number of "observations" available is fairly small. As expected. is influential – the R-squared is 11 percent. the firm wants to distinguish the links fairly quickly.

sites offer ways to minimize the need for a user to explicitly navigate to a friend's page if he/she wishes to see the friend's activities. Apart from privacy and storage cost concerns. regression-based alternative model. we showed that friend counts and profile views also fall short of being able to identify influential site members. primarily because that is the data our partner social networking firm operated on. for example. we believe that our modeling approach – Bayesian shrinkage across the friend effects within a user – will readily extend. Our approach could be readily applied to other data that might also be available to firms operating networking websites. the P>N situation is exacerbated and the primary benefits of our methodology are enhanced. Nonetheless. Our application to the field data provides a vivid illustration of the set of results which firms could obtain from applying the model in practice. Specifically. Our Bayesian shrinkage approach also does so much better than a simpler. Examining a user retention scenario. Major sites compile the main updates of a user's friends' profiles and present this compilation directly on the user's own-profile-page. We believe that these also have important implications for SN sites as businesses. we showed that the model performs well in correctly recovering the influential users and other key features of the data. In addition to the poor performance of profile descriptors in predicting influence.g. We could. we also illustrate the potential for large gaps in financial returns to the firm from using the model-based estimates of influence versus friend count. with details of the interactions). especially for the most important 5 to 10 percent of users. we chose to illustrate our methodology with login data. random selection. To make site usage easy. With richer information of this kind. profile views. as MySpace has done. In this paper. or. should managers or investigators wish to extend the model to include additional usage information. augment our data set with richer information on the overt links (e. social networking firms focus on own-profile-page visitation and login data because the own-profile-page is the center-point of the typical user's activity and interactivity in the site.31 We also assessed the model’s performance using simulated data. .

It is difficult – if not impossible – for a firm to comprehensively know when and how users interact with one another through other digital media.32 We also note that our model does not incorporate the potential dynamics in a user’s influence over time. The contribution of this research is to address the methodological hurdles involved in suitably identifying such users so that the process of targeting and intervention may advance. Other researchers. From interviews with a number of users. In future research. for example. SMS and BlackBerry e-mails. Another potential limitation to our research is shared by most Internet-related studies. we believe that the value of identifying influential users can be established through relatively straightforward (and small scale) field experiments. this might be done by using.. Going forward. have noted the inherent limitations of models built upon behavioral data collected on a single site when users’ activities go beyond it. . heterogeneity in β u along the time dimension can be added. and Kimbrough (2001) also warn of possibly erroneous conclusions from models that are limited to such data. we have learned that many maintain profiles on multiple SN sites. Of course.g. Padmanabhan. as well as digital communications such as Instant Messaging (IM).. A final limitation is that we are unable to address how responsive the top “influencers” selected by our approach are likely to be to marketing actions (e. we focus on modeling online activity tracked by a single site and must leave the potential role of other Internet and offline activities for future research. Zheng. In this paper. a Hidden Markov approach. This means that the managerial implications of our modeling approach are illustrative at this stage. targeted advertising or a firm’s retention efforts). such as Park and Fader (2004). To address this. the offline interactions of users are also potentially relevant and not generally available to researchers or managers. like many others in Internet marketing.

ed. “The Value of a "Free" Customer. 24.A. Robert V.” Psychological Bulletin. “Interactive Marketing and the Meganet: Network of Networks. Mela.” Journal of Marketing Research. 23 (4). 29. David. 5-16. Iacobucci.” The Wall Street Journal. 61-72.msi. Dellarocas. and Dina Mayzlin (2006).33 References Cameron. College Park. Heather (2008).” Journal of Marketing Research.” In Companion in Econometric Theory. Bagozzi and Lisa Klein Pearo (2004). Iacobucci. “Derivation of Subgroups from Dyadic Interactions. “Strategic Manipulation of Internet Opinion Forums: Implications for Consumers and Firms”. Iacobucci. New York: Basil Blackwell. Kozinets.” Social Networks.” as appears at http://www. (1996). 241-263.com/2008/03/07/top-social-networks-traffic-feb-2008/]. “Using Online Conversations to Study Word-of-Mouth Communication.” Harvard Business School. Elizabeth (2006). Dawn. 2006. “The Effect of Word of Mouth on Sales: Online Book Reviews. working paper (#07-035). 107. Colin and Pravin K.” Journal of Marketing Research. (2005).and Small-group-based Virtual Communities”. Chrysanthos N. Roger Th. Baltagi (Ed. Carl F. Holmes. 43 (3). December.” (accessed July 29.Make Way for the New Guys.compete.cfm.com (2008). 21. 545-560.” Journal of Interactive Marketing. Gupta. Compete. MD. 50. Networks in Marketing.. “A Social Influence Model of Consumer Participation in Network. University of Maryland. and Nigel Hopkins (1992). Utpal M. Iacobucci. Chevalier. 114-132.” Marketing Science. “2006-2008 Research Priorities. Dawn (1990). Leenders. 39(1).J. and Jose M. “Essentials of Count Data Regression.” Business Week. Vidal-Sanz (2006). August 31. Sunil. Green. July 30. . Dawn (1998). “The Field Behind the Screen: Using Netnography for Marketing Research in Online Communities. “Modeling Social Influence through Network Autocorrelation: Constructing the Weight Matrix. and Dina Mayzlin (2004). A. Marketing Science Institute (2006). “No Day at the Beach: Bloggers Struggle With What to Do About Vacation. 2008). Trivedi (1999). Dholakia. “Modeling Dyadic Interactions and Networks in Marketing. Judith A. Richard P. working paper. “February Top Social Networks . (2002). Page B1. Dawn.org/msi/rp0608. October 6. B. [http://blog. Winter. 21-48. revision March 2005.). Sage Publications. 5-17. International Journal of Research in Marketing. “Google: Harnessing the Power of Cliques. Godes. (2002).

. and Stefan Decker (2004).. Van den Bulte.wikipedia. 2008). N.” Journal of Marketing Research. forthcoming.mediapost. “Trust between Consumers in Online Communities: Modeling the Formation of Dyadic Relationships. Philippa Pattison and Peter Elliott (2001). 66.” Journal of the Royal Statistical Society. Breslin. 154-164.” Journal of Advertising Research. Zhiqiang Zheng. “Network Models for Social Influence Processes. [www. Walsh. 30. 2008]. A.P. 583-639.” Online Media Daily. Phelps. MA: Marketing Science Institute. “Online Social and Business Networking Communities.san&s=83859].” (accessed August 28. Cambridge. Young-Hoon and Peter S.” Psychometrika. Scott. 161-190.34 Narayan. Spiegelhalter. Vishal and Sha Yang (2006). Christophe and Stefan Wuyts (2007). John G. 44(4).org/wiki/List_of_social_networking_websites.G. “Deriving Value from Social Commerce Networks. O’Murchu. 2001.” Marketing Science. June 3. 333-348. Mark (2008). David Perry. 2002. D. Social Networks and Marketing. [http://en. and van der Linde. Stephen. and Steven O. “Modeling Browsing Behavior at Multiple Websites. “List of Social Networking Websites. August 28. (accessed Sept... Balaji. NYU. Garry. Lewis. Social Network Analysis. “Goldstein at IAB: Marketers Can’t Ignore Social Media. Joseph E. 64.” Working paper. Series B. Park.” In Proceedings of KDD 2001. and Olivier Toubia (2009). “Viral Marketing or Electronic W-O-M Advertising: Examining Consumer Responses to Pass Along Email. John (1992).J. Newbury Park CA: Sage. Best. Lynne Mobilio. Andrew T. Regina..com/publications/index. Summer. 2008). 280-303. and Niranjan Raman (2004). Carlin. Ina. 23 (3). Wikipedia (2008). “Personalization from Incomplete Data: What You Don't Know Can Hurt. “Bayesian Measures of Model Complexity and Fit.. Robins. Kimbrough (2001). Padmanabhan.” DERI Technical Report 2004-08-11.cfm?fuseaction=Articles. B. (2002). Fader (2004).

25 -0.010.93 72. Model Fits Model Model 1 Model 2 Model 3 Description User self effects only (none of friends is influential) Self plus all friends’ effects (all friends are influential) Self and friends effects with variable selection (some friends are influential) 18 Fit (DIC) 76.34 -15.72 18.40.08 4.the Probability of Being Influential in a Dyad Covariate Gender combination (female friend/male user) Months user has been a member Friend is of the same ethnicity as user User is looking for a date Friend is older than user R2 * LHS: log γ uf / (1 − γ uf ) ˆ ˆ Coefficient.66 76. .26 0.82 -2.66 -0.613. which is only slightly better than a DIC of 72863. Explaining Variation in the Posterior Mean Values for γuf . The appendix describes the algorithm for 2 point masses as well as for the general case of K-point masses. we focus our discussion on the 2-class case. Hence.35 Table 1.02 6.* t-statistic 0.04 11% ( ) 18 The DIC for a three point masses model is 72294.863. for expositional ease.36 0.45 Table 2.45 in a two point masses case.

β u | across the 27 corresponding cells .12 40 45 0. Recovery of β u .16 170 160 0.13 0.035 βu 0.14 0.029 Level 3 0. Simulation Design.14 0.12 0.1 Level 3 0.11 0.2 Table 4.14 85 90 0.05 Level 2 0. Level 1 Actual value of Level 2 0.024 ˆ Mean of β u across the 27 corresponding cells ˆ Mean Absolute Error | β u .36 Table 3. Level 1 Coefficient of friend’s f influence on user u β u Number of observations T Number of friends Fu Fraction of Influential Friends pu 0.16 0.

97 75.4 times (82.96 110.5% 4.54 To be read as: cumulative network impact of top 16 individuals identified by the proposed model is 1.67 10.6% 0.12 74.82 32.68 18.88 6.92 4.89 43.09 41.22 20.48 20.6% 3.9 times (82.41 8.0% 7.09 15.51 101.87 47.35 11.24 43.3% 0.77 65. .53/43.51 94.2% 4.09 78.93 53.57 50. the number of friends and the number of profile views.94 99.57 13.71 82.2% 1.7% 7.80 94.44 26.4% 6.41 2.14 50.87 40.12 24.1% 2.3% 3.5% 8.42 61.98 29.46 60.75 88.37) greater than the total impact of top 16 users identified by the non-Bayesian alternative model.01 46.25 19.05 70.4% 2.8% 6.54 50.49) and 3.53 90.66 68.39 23.94 18.15).58 58.27 43.01 39.16 54.60 35.41 6.8% 9.38 73.23 132.61 76.49 48.1% 6.60 4.18 45.88 78.72 24.47 92.70 95.85 45.64 116. Comparison of Cumulative Influence Captured Cumulative Network Impact of Top k Individuals when Ranked by Top k Centile Proposed Model Univariate Friend Model Number of Friends Number of Profile Views * 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 0.92 99.67 74.44 17.15 62.23 64.35 times (82.7% 10.3% 7.04 107.47 11.65 21.44 10.56 26.9% 1.46 41.81 113.95 78.47 49.9% 4.53 30.16 67.99 63.44 39.22 72.37* 25.4% 9.9% 8.50 66.56 25.6% 7.63 9.89 24.14 89.32 4.84 15.50 105.53/24.51 130.27 14.2% 8.5% 1.70 81.0% 3.35 96.90 43.05 40.29 39.8% 2.47 92.04 105.0% 10.68 49.40 17.24 11.80 32.85 37.86 56.07 90.53 85.53 86.5% 5.91 16.05 124.28 122.73 59.19 71.85 52.17 22.8% 5.37 Table 5.26 66.03 102.77 53.98 44.47 35.2% 5.58 50.50 45.33 85. respectively.7% 3.47 119.53/61.32 37.1% 9.77 82.95 14. 1.93 88.78 127.42 25.70 56.

Profile Example .38 Figure 1.

39 Figure 2 (a) Inferring an Individual’s Influence Figure 2 (b) Inferring an Individual’s Influence .

Examples of Weights w(d) Distribution across D Lags for Different ρ u .40 Figure 3. Figure 4. Login Time Series Examples for Four Users .

Distribution of Posterior Mean γuf Figure 7. Estimation Results for Two Users Figure 6.41 Figure 5. Distribution of Smoothing Parameter ρ u .

c2 ) with c1 = 0. β u ~ Normal (0. aK ) with a1 = a2 = . with probability 1.. We generate parameter values α u( n ) and βu( n ) using a normal distribution centered at the maximum likelihood estimate (MLE) for equation (8) with the variance equal to the asymptotic variance (approximated by inverse of the Hessian H of the log-likelihood).100 I ) .. where the Poisson likelihood function is approximated by a normal distribution with Metropolis correction. truncated at 0 on the left. ρu ~ Uniform(c1 . α uk ~ Normal (0.. pu 2 point masses case: pu ~ Beta(a. 5. We accept the new values of α u( n ) and βu( n ) with the probability: . γ uf 2 point masses case: γ uf = 1 with probability pu. where k is a class index. The likelihood considered is conditional on the realizations of γu on that iteration.42 Appendix The Prior Distributions 1.01 and c2 = 5 ( ρ u at value c1 results in weight being evenly distributed over all D lags.. Generate α u . X u .100 I ) 2... We reject candidates for β u ≤ 0 (note that in K point masses case β u is a vector of K-1 elements. K point masses case: pu ~ Dirichlet (a1 . 3. and γ uf = 0 . a2 . βu | γ u . b) with a = b = 1. Z uf We use an independence Metropolis-Hastings sampler. while c2 is chosen to have over 99% of weight assigned to the first lag). The Gibbs Sampler For each user (ego) u: 1. K point masses case: γ uf = k with probability puk.pu. = aK = 1 . since for k = 1 β u1 is set to be 0). 4.

βu . β u . fα and f β are priors on α u and βu respectively. βu . γ uf ) × pu + Lu (α u ... γ uf ) × (1 − pu ) ⎪ ⎪ ⎩ ⎭ (1) (0) where Lu (α u ..43 ' ⎧ ( ( ⎡ ⎛ ⎡ ( o ) ⎤ ⎡ ( MLE ) ⎤ ⎞ ⎛ ⎡α uo ) ⎤ ⎡α u MLE ) ⎤ ⎞ ⎤ ⎫ ( ( ⎪ Lu (α u n ) . Otherwise we keep parameter values from the previous iteration α u 2. where IFu = ∑γ ∀f uf and Fu is a number of “friends” of user u. β . Z uf for all “friends” f of user (ego) u. γ uf ) is an individual likelihood evaluated when γ uf = 0 . β u .e.1⎪ Pr( accept ) = min ⎨ ⎬ ' ⎡ 1 ⎛ ⎡α ( n ) ⎤ ⎡α ( MLE ) ⎤ ⎞ ⎛ ⎡α ( n ) ⎤ ⎡α ( MLE ) ⎤ ⎞ ⎤ ⎪ ⎪ (o) (o) (o) (o) n u u u u ⎪ Lu (α u . β u | γ u ) × fα (α u ) × f β ( β u ) × exp ⎢ − ⎜ ⎢ ( n ) ⎥ − ⎢ ( MLE ) ⎥ ⎟ H ⎜ ⎢ ( n ) ⎥ − ⎢ ( MLE ) ⎥ ⎟ ⎥ ⎪ ⎟ ⎜ ⎟ ⎢ 2 ⎜ ⎣ βu ⎦ ⎣ βu ⎦ ⎠ ⎝ ⎣ βu ⎦ ⎣ βu ⎦ ⎠⎥ ⎭ ⎪ ⎝ ⎣ ⎦ ⎪ ⎩ where Lu is an individual likelihood function. γ uf ) × pu ⎪ ⎪ Pr(γ uf = 1) = ⎨ ⎬. a2 + Fu 2 . β u . Generate γ uf | α u . 3. γ uf ) × puj ⎪ ⎪ j =1 ⎪ ⎩ ⎭ ( where Lu (α u . βu . friend f is assigned to class k). K point masses case: We draw γ uf as a categorical random variable selecting a friend index f in a random order: Pr(γ uf ⎧ ⎫ ⎪ L (α . We draw γ uf as a Bernoulli selecting a friend index f in a random order: (1) ⎧ ⎫ Lu (α u . γ ufk ) ) is an individual likelihood evaluated when γ uf = k (i. γ uf ) is an individual likelihood evaluated when γ uf = 1 and Lu (α u ..IFu ) . b + Fu . . γ ( k ) ) × p ⎪ ⎪ uk ⎪ = k ) = ⎨ K u u u uf ⎬. X u . aK + FuK ) . ( j) ⎪ ∑ Lu (α u . β u . β u( n ) | γ un ) × fα (α u n ) ) × f β ( β u( n ) ) × exp ⎢ − 1 ⎜ ⎢α u ⎥ − ⎢α u ⎥ ⎟ H ⎜ ⎢ ( o ) ⎥ − ⎢ ( MLE ) ⎥ ⎟ ⎥ ⎪ ⎟ ⎪ ⎢ 2 ⎜ ⎣ β u( o ) ⎦ ⎣ β u( MLE ) ⎦ ⎟ ⎜ ⎣ β u ⎦ ⎣ β u ⎦ ⎠⎥ ⎪ ⎝ ⎠ ⎝ ⎪ ⎣ ⎦ .. (1) (0) Lu (α u . Generate pu | γ uf for ∀f 2 point masses case: We draw pu ~ Beta (a + IFu . βu . K point masses case: We draw pu ~ Dirichlet (a1 + Fu1 . 2 point masses case: (o) and βu( o ) .

X u . γ uf . X u . We accept the new value of ρ u n ) with the probability: ( ⎧ Lu (α u . Generate ρ u | α u . 4. Z uf ) × f ρ ( ρu ) ⎪ ⎩ ⎭ where z (n) uft = ∑w d =1 D (n) ud × y f ( t − d ) . . X u . A ( ( ( candidate ρ u n ) is formed as ρ u n ) = ρ u o ) + ζ . w (d ) = (n) u ( exp(− d × ρu n ) ) ∑ exp(−k × ρ k =1 D (n) u . Z ufn ) ) × f ρ ( ρun ) ⎫ ⎪ ⎪ Pr(accept ) = min ⎨ . (o) o ⎪ Lu (α u . It is ( fixed thereafter. β u . β u .1⎬ . σ ρ ) and σ ρ is being adjusted dynamically during burn-in iterations to insure acceptance rate in 25 to 45 percent range. Otherwise we keep ( parameter values from the previous iteration ρ u o ) and do not update Zuf. Z uf We draw a smoothing parameter ρ u using a random walk Metropolis-Hastings algorithm. and f ρ is prior on ρ u ) ( If a candidate value ρ u n ) is accepted. where ζ ~ Normal (0. β u . we use new weights wu to calculate Zuf.44 where Fuk = ∑1(γ ∀f uf = k) .