Professional Documents
Culture Documents
Preferential attachment in the growth of social networks: The internet encyclopedia Wikipedia
Statistical properties of social networks have become a independent individuals in various languages, covering top-
major research topic in statistical physics of scale-free net- ics they consider relevant and about which they believe to be
works 关1,2兴. Social networks can be of different types but competent. Our dataset encompasses the whole history of the
may be classified either belonging to collaboration systems Wikipedia database, reporting any addition or modification
or information systems. Among collaboration systems we to the encyclopedia. Therefore, the rather broad information
can find, e.g., scientific coauthorship networks 关3兴, where contained in the Wikipedia dataset can be used to validate
edges are drawn between scientists and those who co- existing models for the development of scale-free networks.
authored the same paper; or actors collaboration networks In particular, we found here one of the first large-scale con-
关4兴, where edges are drawn between actors playing in the firmations of the preferential attachment, or “rich-get-richer,”
same movie. Information systems are the result of human rule. This result is rather surprising, since preferential attach-
interaction, intended in a broad sense. We mention here the ment is usually associated to network growth mechanisms
network of sexual contacts 关5兴, where the study of its statis-
triggered by local events: in the WWW, for instance, web-
tical properties is strictly related to the spreading of infec-
masters have control on their own web pages and outgoing
tious diseases 关6兴, and the World Wide Web 共WWW兲, which
is often put outside the social networks category due to its hyperlinks, and cannot modify the rest of the network by
peculiarities 关7兴. adding edges elsewhere. Instead, by the “Wiki” technology a
In this paper, we analyze the graph of Wikipedia 关8兴, a single user can edit an unlimited number of topics and add
virtual on-line encyclopedia. This topic attracted very much edges within the Wikipedia network.
interest in recent times 关9,10兴 because of its topological The dataset of Wikipedia at the time of submission of this
structure. Wikipedia grows constantly as new entries are con- analysis is made of pages in about 100 different languages;
tinuously added by users through the Internet. Thanks to the the largest subset is almost 1 000 000 pages of the English
Wiki software 关11,12兴, any user can introduce new entries version. All sets are still growing at an exponential pace 关14兴.
and modify the existing ones. It is natural to represent this The dataset we considered in our analysis 共June 2004兲 are
system as a directed graph, where vertices correspond to en- smaller and correspond to the figures given in Table I with
tries and edges to hyperlinks, the latter autonomously drawn the largest dataset 共English兲 of about 300 000 vertices.
between various entries by independent contributors. A detailed analysis of the algorithms used to crawl such
We find that the Wikipedia graph exhibits a topological data is presented elsewhere 关15兴. Here, we start our analysis
bow-tie-like structure, as does the WWW 关13兴. Moreover,
the frequency distributions of the number of incoming 共in- TABLE I. Size of the graph as collected in the dump of June 13,
degree兲 and outgoing 共out-degree兲 edges show fat-tail power- 2004.
law behaviors. Further, the in degrees of connected vertices
are not correlated. These last two findings suggest that edges Language Vertices Edges
are not drawn toward and from existing topics uniformly.
Rather, the large number of incoming and outgoing edges of Portuguese 8645 51 231
a node increases the probability of acquiring new incoming Italian 13 132 159 965
and outgoing edges, respectively. In the literature concerning Spanish 27 262 288 766
scale-free networks, this phenomenon is called “preferential French 42 987 660 401
attachment” 关4兴 and is explained in detail below. German 116 251 2 163 405
Wikipedia is an intriguing research object from a sociolo- English 339 834 5 278 037
gist’s point of view: pages are published by a number of
036116-2
PREFERENTIAL ATTACHMENT IN THE GROWTH OF… PHYSICAL REVIEW E 74, 036116 共2006兲
FIG. 3. The preferential attachment for the in degree and the out
FIG. 2. In-degree 共white symbols兲 and out-degree 共filled sym- degree in the English and Portuguese Wikipedia network. The solid
bols兲 distributions for the Wikipedia English 共circles兲 and Portu- line represents the linear preferential attachment hypothesis ⌸
guese 共triangles兲 graph. The solid line and the dashed line represent ⬃ kin,out.
simulation results for the in-degree and the out-degree respectively,
for a number of 10 edges added to the network per time step. To draw a more complete picture of the Wikipedia net-
in,out 共bottom line兲 and the kin,out 共top
Dotted-dashed lines show the k−2.1 −2 work, we have also measured the correlations between the in
line兲 behavior, as a guide for the eye. and out degrees of connected pages. The relevance of this
quantity is emphasized by several examples of complex net-
with in-degree k at time t 关17兴. The weighting factor works shown to be fully characterized by their degree distri-
1 / P共k , t兲 is necessary to take into account the multiplicity of bution and degree-degree correlations 关20兴. Among other
nodes with the same degree. In fact, the conditional probabil- quantities, suitable measure of such correlations is repre-
ity that a node i with degree ki acquires a new node at time i, sented by the average degree K共nn兲共k兲 of vertices connected
is proportional to ⌸共ki兲 / P共ki , t兲. to vertices with degree k 共for simplicity, here we refer to a
Since the Wikipedia network is oriented, the preferential nonoriented network to explain the notation兲 关21兴. These
attachment must be verified in both directions. In particular, quantities are particularly interesting when studying social
we have observed how the probability of acquiring a new networks. As for other social networks, the collaborative net-
incoming 共outgoing兲 edge depends on the present in 共out兲 works studied so far are characterized by assorted mixing,
degree of a vertex. The result for the main Wikipedia net- i.e., edges preferably connect vertices with similar degrees
work 共the English one兲 is reported in Fig. 3. 关7兴. This picture would reflect in a K共nn兲共k兲 growing with
With linear preferential attachment, as supposed by the respect to k. If K共nn兲共k兲 共decays兲 grows with k, vertices with
BA model, both plots should be linear over the entire range similar degrees are 共un兲likely to be connected. This appears
of degree; here we recover this behavior only partly. This is to be a clear cutting method to establish whether a complex
not surprising, since several measurements reported in litera- network belongs to the realm of social networks, if other
ture display strong deviations from a linear behavior 关18兴 for considerations turn ambiguous 关22兴.
large values of the degree, even in networks with an inherent In the case of an oriented network, such as Wikipedia, one
preferential attachment. Such behavior is the result of finite has many options while performing such assessment, since
size effects 关17兴. By carefully observing Fig. 3 we note that one could measure the correlations between the in or the out
for certain datasets 共e.g., English兲, the slope of the growth of degrees of neighbor vertices, along incoming or outgoing
⌸ at small connectivities might follow a sublinear law, thus edges. We chose to study the average in-degree K共nn兲in 共kin兲 of
not completely ruling out different mechanisms more com- upstream neighbors, i.e., pointing to vertices with in-degree
plicated than preferential attachment. This is an important kin. By focusing on the in-degree and on the incoming edges,
issue, since a sublinear preferential attachment fails to pro- we expect to extract information about the collective behav-
duce scale-free networks 关19兴. However, since the sublinear ior of Wikipedia contributors and filter out their individual
exponent is very close to unity 共ca. 0.9 for the English Wiki- peculiarities: the latter have a strong impact on the out-
pedia兲, we shall consider it as strictly unity in order to define degree of a vertex and on the choice of its outgoing edges,
a simplified model able to reproduce the stylized facts of the since contributors often focus on a single Wikipedia topic
systems. 关14兴.
Further, it is worth to mention that the preferential attach- Our analysis shows a substantial lack of correlation be-
ment in Wikipedia has a somewhat different nature than in tween the in degrees of a vertex and the average in-degree of
other networks. Here, most of the times, edges are added its upstream neighboring vertices. So, as reported in Fig. 4,
between already existing vertices, unlike the BA model. incoming edges carry no information about the in degrees of
036116-3
CAPOCCI et al. PHYSICAL REVIEW E 74, 036116 共2006兲
036116-4
PREFERENTIAL ATTACHMENT IN THE GROWTH OF… PHYSICAL REVIEW E 74, 036116 共2006兲
−关1/共1−R1兲兴−1
P共kout兲 ⯝ kout , 共1兲 what reasonable and expected, by considering the strong as-
sumptions made in the model definition. A community struc-
which can be easily verified by numerical simulation. By ture could arise in models where thematic divisions are taken
adopting the values empirically found in the English Wiki- into account by means of fitness hidden variables 关30–32兴.
pedia R1 = 0.026, R2 = 0.091, and R3 = 0.883, one recovers the In conclusion, the bow-tie structure already observed in
same power-law degree distributions of the real network, as
the World Wide Web, and the algebraic decay of the in-
shown in Fig. 2.
degree and out-degree distribution are observed in the Wiki-
As regards the English version of Wikipedia a largely
pedia datasets surveyed here. At a deeper level, the structure
dominant fraction R1 of new edges is created between two
existing pages, while smaller fractions R2 and R3 of edges, of the degree-degree correlation also resembles that of a net-
respectively, point or leave a newly added vertex. By analyz- work developed by a simple preferential attachment rule.
ing the update record of the above English Wikipedia data- This has been verified by comparing the Wikipedia dataset to
base we measured that, in the period 2001–2004, such frac- models displaying no correlation between neighbors degrees.
tions took the values R1 = 0.883± 0.001, R2 = 0.026± 0.001, Thus, empirical and theoretical evidences show that tradi-
and R3 = 0.091± 0.001. tional models introduced to explain nontrivial features of
The degree-degree correlations K共nn兲
in 共kin兲 can be computed
complex networks by simple algorithms remain qualitatively
analytically by the same lines of reasoning described in Refs. valid for Wikipedia, whose technological framework would
关22,29兴, and for 1 Ⰶ kin Ⰶ N we have allow a wider variety of evolutionary patterns. This reflects
on the role played by the preferential attachment in generat-
共nn兲 MR1R2 1−R ing complex networks: such mechanism is traditionally be-
Kin 共kin兲 ⬃ N 1 共2兲
R3 lieved to hold when the dissemination of information
throughout a social network is not efficient and a “bounded
for R3 ⫽ 0, the proportionality coefficient depending only on
rationality” hypothesis 关33,34兴 is assumed. In the WWW, for
the initial condition of the network, and
example, the preferential attachment is the result of the dif-
K共nn兲
in 共kin兲 ⯝ MR1R2 ln N 共3兲 ficulty for a webmaster to identify optimal sources of infor-
mation to refer to, favoring the herding behavior which gen-
for R3 = 0, where N is the network size. Both equations are erates the “rich-get-richer” rule. One would expect the
independent from kin, as confirmed by the simulation re- coordination of the collaborative effort to be more effective
ported in Fig. 4 for the same values of R1, R2, and R3. There- in the Wikipedia environment since any authoritative agent
fore, the theoretical degree-degree correlation reproduces can use his expertise to tune the linkage from and toward any
qualitatively the observed behavior; to obtain a more accu- page in order to optimize information mining. Nevertheless,
rate quantitative agreement with data, it is sufficient to tune empirical evidences show that the statistical properties of
the initial conditions appropriately. As shown in Fig. 4, this Wikipedia do not differ substantially from those of the
can be done by neglecting a small fraction of initial vertices WWW. This suggests two possible scenarios: preferential at-
in the network model. In the first case the values of the tachment may be the consequence of the intrinsic organiza-
exponents can vary from −2 共when R1, R2, respectively, are tion of the underlying knowledge; alternatively, the preferen-
0兲 to minus infinite 共when R1, R2, respectively, are −1兲. For tial attachment mechanism emerges because the Wiki
the expected 共empirical兲 range of variation 共i.e., from 2% to technical capabilities are not fully exploited by Wikipedia
20%兲 R2 = 0.02→ ␥in = 2.02, R2 = 0.2→ ␥in = 2.25, R1 = 0.02 contributors: if this is the case, their focus on each specific
→ ␥out = 2.02, R1 = 0.2→ ␥out = 2.25. As regards the slope of subject puts much more effort in building a single Wiki en-
the average neighbor coefficient, this does not depend upon try, with little attention toward the global efficiency of the
the values of the parameter. We confirm these behaviors by organization of information across the whole encyclopedia.
means of numerical simulations.
Nevertheless, this model does not reproduce the observed The authors acknowledge support from the European
complex community structure of Wikipedia. This is some- Project DELIS.
关1兴 R. Albert and A.-L. Barabási, Rev. Mod. Phys. 74, 47 共2002兲. 关7兴 M. E. J. Newman, Phys. Rev. Lett. 89, 208701 共2002兲.
关2兴 S. N. Dorogovtsev and J. F. F. Mendes, Evolution of Networks: 关8兴 http://www.wikipedia.org/
From Biological Nets to the Internet and Www 共Oxford Uni- 关9兴 T. Holloway, M. Božičević, and K. Börner, e-print cs.IR/
versity Press, Oxford, 2003兲. 0512085.
关3兴 M. E. J. Newman, Phys. Rev. E 64, 016131 共2001兲; 64, 关10兴 After our submission on the arXiv, a new paper appeared on
016132 共2001兲. these topics V. Zlatić, M. Božičević, H. Štefančić, and M.
关4兴 A.-L. Barabási and R. Albert, Science 286, 509 共1999兲. Domazet, e-print physics/0602149.
关5兴 F. Liljeros et al., Nature 共London兲 411, 907 共2001兲. 关11兴 http://en.wikipedia.org/wiki/Wiki
关6兴 R. Pastor-Satorras and A. Vespignani, Phys. Rev. Lett. 86, 关12兴 F. B. Viégas, M. Wattenberg, and K. Dave, CHI ’04: Proceed-
3200 共2001兲; Phys. Rev. E 65, 035108共R兲 共2002兲. ings of the SIGCHI conference on Human factors in comput-
036116-5
CAPOCCI et al. PHYSICAL REVIEW E 74, 036116 共2006兲
ing systems, 2004, p. 575. 关24兴 A. Capocci et al., Physica A 352, 669 共2005兲.
关13兴 A. Z. Broder et al., Comput. Netw. 33, 309 共2000兲. 关25兴 M. E. J. Newman, e-print physics/0605087.
关14兴 J. Voss, Proceedings 10th International Conference of the In- 关26兴 G. Caldarelli, C. C. Cartozo, P. De Los Rios, and V. D. P.
ternational Society for Scientometrics and Informetrics 2005. Servedio, Phys. Rev. E 69, 035101共R兲 共2004兲.
关15兴 D. Donato et al., A library of software tools for performing 关27兴 P. L. Krapivsky, G. J. Rodgers, and S. Redner, Phys. Rev. Lett.
measures on large networks, COSIN Tech-report Deliverable 86, 5401 共2001兲.
D13 共http://www.dis.uniroma1.it/~cosin/publications兲, 2004. 关28兴 B. Tadic, Physica A 293, 273 共2001兲.
关16兴 H. A. Simon, Biometrika 42, 425 共1955兲. 关29兴 A. Barrat and R. Pastor-Satorras, Phys. Rev. E 71, 036127
关17兴 M. E. J. Newman, Phys. Rev. E 64, 025102共R兲 共2001兲. 共2005兲.
关18兴 M. Peltomäki and M. Alava, e-print physics/0508027. 关30兴 G. Caldarelli, G. Capocci, P. De Los Rios, and M. A. Munoz,
关19兴 P. L. Krapivsky, S. Redner, and F. Leyvraz, Phys. Rev. Lett. Phys. Rev. Lett. 89, 258702 共2002兲.
85, 4629 共2000兲. 关31兴 D.-H. Kim, B. Kahnga, and D. Kim, Eur. Phys. J. B 38, 305
关20兴 G. Bianconi, G. Caldarelli, and A. Capocci, Phys. Rev. E 71, 共2004兲.
066116 共2005兲. 关32兴 M. Boguñá, R. Pastor-Satorras, A. Diaz-Guilera, and A. Are-
关21兴 A. Vázquez, R. Pastor-Satorras, and A. Vespignani, Phys. Rev. nas, Phys. Rev. E 70, 056122 共2004兲.
E 65, 066130 共2002兲; S. Maslov and K. Sneppen, Science 关33兴 M. Rosvall and K. Sneppen, Phys. Rev. Lett. 91, 178701
296, 910 共2002兲. 共2003兲.
关22兴 A. Capocci and F. Colaiori, e-print cond-mat/0506509. 关34兴 S. Mossa, M. Barthelemy, H. E. Stanley, and L. A. Nunes
关23兴 M. E. J. Newman, Eur. Phys. J. B 38, 321 共2004兲. Amaral, Phys. Rev. Lett. 88, 138701 共2002兲.
036116-6