You are on page 1of 77

The Performance of Online Communities –

An Empirical Investigation of Wikipedia

Diploma Thesis
at the Institute of Entrepreneurship & Innovation
Vienna University of Economics and Business Administration
Degree Program: Business Administration

Submitted by:
Roman Pickl
Degree Program Identification No.: J151
Student Enrolment No.: h0451691

Advisor: Univ. Prof. Dr. Nikolaus Franke

Assistant Advisor: Dr. Philipp Türtscher

Vienna, 17 June 2009


Abstract

Online communities have been thriving in recent years and have not only drawn the attention
of researchers and professionals but also influence our daily lives. Their success factors,
however, are still rather unclear. This paper sheds light on this topic by analyzing the rela-
tionships between the characteristics of online communities and their performance. There-
fore, 5000 communities around individual articles in Wikipedia were analyzed. The results
demonstrate that community characteristics are significantly linked to the quantity and qual-
ity of the output created by online communities. The number of users is by far the most influ-
ential force that drives content creation. When it comes to the quality of the output, however,
characteristics of community members and how they collaborate are as important as the
sheer number of contributors. These findings can be utilized by community operators who
want to foster the development of their online communities and firms which look for promis-
ing communities to scan for innovative ideas and users.

Acknowledgments

I would like to express my gratitude to my advisor Univ. Prof. Dr. Nikolaus Franke and my
assistant advisor Dr. Philipp Türtscher for their support, encouragement and time to listen to
and discuss little problems and roadblocks. Special thanks also go to Matthias Pickl and
Daniel Winzer who provided thoughtful comments on this thesis. Furthermore, I want to
thank the IT Department of the Vienna University of Economics and Business Administra-
tion, especially Franz Schäfer, for helping with the means to handle the huge amount of data
and thus making this research project possible. Finally, I thank my family and my girlfriend
Julia for their continuous support and encouragement.

Roman Pickl | The Performance of Online Communities |I|


Table of contents

1 Introduction................................................................................................................... 1
1.1 Objective.................................................................................................................. 2
1.2 Structure.................................................................................................................. 3
2 Literature Review and Hypotheses Development....................................................... 4
2.1 Online Communities ................................................................................................ 4
2.2 Performance of Online Communities ....................................................................... 6
2.2.1 Information-Quantity ........................................................................................... 7
2.2.2 Article-Quality ..................................................................................................... 8
2.3 Links between Community Characteristics and Performance................................... 9
2.3.1 Community-Centered Perspective .................................................................... 10
2.3.2 User-Centered Perspective............................................................................... 12
2.3.3 Collaboration-Centered Perspective ................................................................. 14
3 Research Method........................................................................................................ 17
3.1 Research Site ........................................................................................................ 17
3.2 Study Design ......................................................................................................... 18
3.3 Data Collection and Cleansing............................................................................... 19
3.4 Measures............................................................................................................... 22
3.4.1 Operationalization of Performance Indicators ................................................... 22
3.4.2 Operationalization of Community Characteristics.............................................. 23
4 Results ........................................................................................................................ 26
4.1 Descriptive Statistics.............................................................................................. 26
4.2 Inferential Statistics................................................................................................ 28
4.2.1 Results related to Information-Quantity............................................................. 29
4.2.2 Results related to Article-Quality....................................................................... 30
5 Discussion and Implications...................................................................................... 31
5.1 Implications for Theory........................................................................................... 35
5.2 Implications for Methods ........................................................................................ 36
5.3 Implications for Practice......................................................................................... 37
5.4 Limitations ............................................................................................................. 38
5.5 Directions for Future Research .............................................................................. 39
6 References .................................................................................................................. 41
7 Appendix ..................................................................................................................... 52
7.1 Figures and Examples ........................................................................................... 52
7.2 Data Collection ...................................................................................................... 53
7.2.1 Setting up the Database and Drawing the Sample............................................ 53
7.2.2 Parsing the Data and Calculating all Variables ................................................. 58
7.3 References used in the Appendix: ......................................................................... 73

Roman Pickl | The Performance of Online Communities | II |


Figures and Tables

Figure 1: Characteristics of online communities (original illustration)............................ 9


Figure 2: Excerpt of the revision history of a Wikipedia article (Wikipedia 2009h)....... 17
Figure 3: Research model (original illustration) .............................................................. 19
Figure 4: Sampling process (original illustration) ........................................................... 20

Table 1: Results of the conducted OLS-regressions....................................................... 28


Table 2: Summary of results for hypotheses H1A-H6A (Information-Quantity) ............ 31
Table 3: Summary of results for hypotheses H1B-H6B (Article-Quality) ....................... 33
Table 4: Significant effects of community characteristics on performance .................. 35

Roman Pickl | The Performance of Online Communities | III |


1 Introduction

With the increasing ubiquity of the Internet, recent years have seen a surge in the number of
online communities1 (Kozinets 1999, p. 253; Rashid et al. 2006, p. 955). These communities
of interest are known for attracting innovative users with a high level of domain-specific
knowledge and are hence an important source of innovation (Füller, Jawecki & Mühlbacher
2007, p. 60). Even though open source software projects like Linux and Apache are among
the first examples that come to mind when thinking about successful online communities, this
phenomenon is not limited to the software sector. In fact, communities have shown astonish-
ing performances in numerous diverse industries (Lakhani & Panetta 2007, p. 98).

Wikipedia, the online encyclopedia that allows everyone to edit or create new articles, is a
case in point. Initiated in 2001, it has grown to more than 11 million articles in more than 260
languages written by more than 14 million users (Wikimedia 2008d). As of today, it is one of
the most visited pages on the Internet (Alexa.com 2008) and even though often scrutinized
due to its open source principle, generally known for its notably high quality (Giles 2005, p.
900).

The success of Wikipedia and various other online communities has not only drawn the atten-
tion of researchers and professionals but also influenced how amongst others society func-
tions in terms of production, learning, communication and commerce (Cothrel & Williams
1999, p. 54; Tapscott & Williams 2008, p. 20; Wanga & Fesenmaier 2004, p. 709). Conse-
quently, many firms across industries have tried to embrace online communities, in particular
to harness their creative potential for developing new products and services (Füller, Matzler &
Hoppe 2008, p. 609; Nambisan 2002, p. 393). Many of those communities, however, fail (Co-
threl & Williams 1999, p. 54) and success factors still remain rather unclear (Leimeister, Sid-
iras & Krcmar 2006, p. 281).

1
The terms “online community”, “virtual community”, “computer-mediated community”, “cyber commu-
nity”, “net community” or “e-community” are often used interchangeably (Döring 2001; Wanga & Fe-
senmaier 2004, p. 709)

Roman Pickl | The Performance of Online Communities |1|


1.1 Objective

Even though online communities have been studied from a variety of perspectives and several
authors have derived recommendations for operating online communities, characteristics of
successful communities have hardly been substantiated empirically (Leimeister, Sidiras &
Krcmar 2006, p. 281). This thesis aims to close this research gap by introducing a framework
to combine several of these perspectives and empirically analyze the link between characteris-
tics of online communities and their performance.

Businesses that want to involve online communities in their innovation process generally have
two distinct options: they can either try to find and utilize already existing communities or
attempt to build their own (Franke 2005, p. 708). Analyzing the link between community
characteristics and performance yields valuable insights for both, operators who want to pro-
actively encourage the development of successful online communities and firms which look
for promising communities to scan for innovative ideas and users. Due to the rise of a new
paradigm, often called “Web 2.0” (O'Reilly 2005), which causes growing interest in user gen-
erated content and user participation today (Tapscott & Williams 2008, p. 38), it is even more
important to shed light on this topic.

Given that Wikipedia “can be viewed as a massive experiment in collective action” (Viégas et
al. 2007, p. 2), observing communities in this online environment allows the examination of
numerous communities with different characteristics. In this study, a random sample of 5000
articles, created in 2007, is retrieved from Wikipedia and the characteristics of the community
of users collaborating on each article are analyzed. These characteristics are then linked to the
quantity and quality of the output created by these diverse online communities to answer the
following research question:

How is the performance of an online community related to its characteristics?

Roman Pickl | The Performance of Online Communities |2|


1.2 Structure

This thesis is structured as follows: chapter 2 provides a summary of the results of an in-
depth literature review and deals with the current state of research. Furthermore, the research
question is formulated in more detail and several hypotheses are developed. While chapter 3
provides a detailed explanation of the methodology used in this thesis, results are presented in
chapter 4. Chapter 5 concludes with a discussion of the results, implications for theory,
methods and practice as well as an outlook for promising future research directions. Addition-
ally, several examples along with an in-depth explanation of the data collection and analysis
method used in this study can be found in the appendix.

Roman Pickl | The Performance of Online Communities |3|


2 Literature Review and Hypotheses Development

This chapter defines key terms and provides an insight into the theoretical background of this
study. After introducing the concept of online communities and factors indicating community
performance, the relationships between the characteristics of online communities and their
performance are explored from various perspectives.

2.1 Online Communities

The idea of geographically separated people meeting online to talk about common areas of
interest and to build online communities is older than the Internet itself (Licklider & Taylor
1968, p. 38). In fact, one of the first services of the Arpanet, file transfer, was soon diverted
from its intended use and employed for sending messages (Barabási 2003, p. 149). As a re-
sult, huge mailing lists emerged and due to the rise of the Arpanet, other networks and even-
tually the Internet more and more people were able to gather online and exchange their
thoughts on various topics (Cothrel & Williams 1999, p. 54; Koch 2002, p. 327). As a result
of his positive experiences of support and friendship on the bulletin board system “The
WELL” Rheingold coined the term “Virtual Community” to describe this social phenomenon
(Döring 2001; Rheingold 1993). Consequently, a scholarly discussion started whether the
term “community” should be used in this context at all as virtual communities lack face-to-
face contact and various other characteristics of traditional communities (Döring 2001; Jones
1997; Preece & Maloney-Krichmar 2005).

Even though scientists nowadays generally agree that virtual communities are “real” commu-
nities (Döring 2001) and the strict distinction between online and offline activities is losing
importance (Preece, Maloney-Krichmar & Abras 2003, p. 8), online communities are still a
vague concept with no widely accepted definition (Leimeister, Sidiras & Krcmar 2006, p.
278; Preece 2001, p. 347). It is, however, not the intent of this paper to comprehensively re-
view and analyze each single argument. Rather, online communities are understood as a cate-
gory with fuzzy boundaries (Bruckman 2006, p. 618) and are discussed in a broad context.

Roman Pickl | The Performance of Online Communities |4|


Online communities have existed in various forms for approximately three decades (Ridings,
Gefen & Arinze 2002, p. 272). In recent years several researchers stressed the high innovation
potential of online and offline user communities and examined the specific ways of how their
users cooperate and support each other (Franke & Shah 2003, p. 159; von Hippel 2001, pp.
84-85). They argue that users, not producers, are responsible for a considerable amount of
major innovations in various industries (Franke & Shah 2003, p. 157), in some occasions even
without the need for a manufacturer (von Hippel 2001, p. 86). As a result, more and more
businesses are nowadays trying to harness the enormous creative potential of online commu-
nities (Füller, Matzler & Hoppe 2008, p. 609). However, many of them fail (Cothrel & Wil-
liams 1999, p. 54) and success factors are still quite unclear (Leimeister, Sidiras & Krcmar
2006, p. 281).

This paper aims to shed light on this topic by examining the relationships between community
characteristics and performance. Therefore, communities of users collaborating on individual
articles in Wikipedia are investigated. One may argue that observing the community around
an article amounts to taking it out of context, similarly to merely tracking changes in a single
file of an open source project, without considering dependencies. Mateos Garcia & Stein-
mueller (2003), however, call attention to an important difference between open source soft-
ware systems like Linux and open content collections like Wikipedia. While contributions to
open source software (e.g. a new software module) inhere a high need for integration due to
their “cumulative dependency”, articles in a collection are far less dependent on each other (p.
17) and valuable as standalone items (Wales 2005a, 1:30). In line with these findings, Voss
highlights that “most likely one can determine subcommunities” in Wikipedia (2005, p. 8).

Consequently, it seems plausible to consider the communities collaborating on every article of


Wikipedia as independent online communities. This not only allows to utilize Wikipedia as a
massive experiment but also to examine the produced output and performance of numerous
communities with different characteristics.

Roman Pickl | The Performance of Online Communities |5|


2.2 Performance of Online Communities

Due to the fact that there are numerous kinds of online communities it is of uttermost impor-
tance to define clear and measurable objectives to asses their performance (Cothrel 2000, pp.
17-18). What is more, online communities can be examined from different perspectives and
consequently various performance indicators are discussed in the literature (Leimeister, Sid-
iras & Krcmar 2006, p. 279; Preece 2001, p. 354). While many of them, however, are very
general measures, Stvilia et al. suggest a specific set of metrics to assess Wikipedia articles
which seems especially relevant in the context of this paper (2005, p. 3). Applicable and rele-
vant factors of these studies were adapted to the research setting and enriched with a number
of additional metrics. Thus, this paper applies a direct approach when measuring the perform-
ance of different communities in Wikipedia by analyzing their output, whereas previous re-
search often focused on participation as a proxy of value creation (Cothrel & Williams 1999,
p. 55; Preece 2001, pp. 350-351).

Cothrel & Williams point out that a successful online community is “one that achieves its
purpose” (1999, p. 55). Asked about the purpose of Wikipedia Co-founder Jimmy Wales once
remarked that:

“Wikipedia is first and foremost an effort to create and distribute a free encyclopedia
of the highest possible quality to every single person on the planet in their own lan-
guage. […] the entire purpose of the community is precisely this goal” (2005b).

Even though the success of online communities is often a question of perspective (Leimeister,
Sidiras & Krcmar 2006, p. 279; Preece 2001, p. 354), due to their intrinsic motivation non-
commercial operators tend to agree on success factors with community members (Leimeister,
Sidiras & Krcmar 2006, p. 292). Thus in the case of the non-profit project Wikipedia, the pur-
pose for members and operators is crystal clear: create the largest high-quality reference work
for free. Consequently, the performance of communities in Wikipedia needs to be evaluated in
terms of the quantity of information and the quality of the articles produced.

Roman Pickl | The Performance of Online Communities |6|


2.2.1 Information-Quantity

When thinking about quantifying the information included in Wikipedia articles, the length of
each article is the first measure that comes to mind. However, if counting the number of
words were the only measure applied, articles that repeat the same information over and over
again would score far too high. Therefore, to increase its validity, the analysis was comple-
mented with an examination of the vocabulary used in each article, a proxy of the information
content contained. Furthermore, the number of links was assessed to take the amount of ex-
ternal information referred to into account. The following paragraphs provide additional in-
formation on these measures.

Number of words: The length of articles in Wikipedia and hence the quantity of information
produced by different communities varies sharply (Stvilia et al. 2005, p. 7). To asses the
length of every article the total number of words was counted. This measure was preferred to
using the mere number of characters to take different topics and hence varying word-lengths
of their vocabularies into account.

Vocabulary: This measure depicts the number of unique words and hence the word pool used
in the community effort. A short test revealed that this easy-to-understand, yet easy-to-
calculate metric correlates very well (0.985) with the zipped file size of the articles which is
known to be a good proxy of entropy (Voss 2005, p. 9).

Number of links: Buriol et al. found that the average number of outgoing links per Wikipedia
article increased over the last few years (2006, p. 5). This measure not only indicates the
amount of external information incorporated, but also the number of additional references
provided and is hence considered as an important dimension of information quantity.

Roman Pickl | The Performance of Online Communities |7|


2.2.2 Article-Quality

The results of an expert-led investigation carried out by Nature points out that Wikipedia arti-
cles have a similar quality to those in the Encyclopaedia Britannica (Giles 2005, p. 900). This
study has attracted much attention and has been criticized for the selection and comparison of
articles (Encyclopædia Britannica 2006). Due to the fact that only a small number of articles
(42) was reviewed, it additionally lacks validity (den Besten, Loubser & Dalle 2008, p. 1). To
overcome these weaknesses several researchers have tried to automatically assess the quality
of Wikipedia articles (p. 8). One approach is to calculate readability scores as a measure of
quality (p. 8), a metric also applied in this study. To take the level of integration of each arti-
cle into account this analysis was complemented with an examination of the number of cate-
gories each article is placed into. Further details are provided in the following paragraphs.

Readability: Readability metrics have been used by several researchers to assess the quality
of Wikipedia articles (den Besten, Loubser & Dalle 2008, p. 8; Stvilia et al. 2005, p. 9).
Stvilia et al., for example, found that featured articles, i.e. a selection of the best articles de-
termined by Wikipedia’s editors (Wikipedia 2009e), show higher Flesch readability scores
than articles in a random set (2005, p. 7). The Flesch readability formula is a very popular
function of the number of words per sentence and the number of syllables per word used in a
text which yields a number between 0 (very difficult) & 100 (very easy) to assess its readabil-
ity (den Besten, Loubser & Dalle 2008, p. 10). While very easy to compute, it allows assess-
ing the readability of texts with considerable accuracy (p. 11).

Number of categories: The categorization of articles, which is based on a special form of


social tagging, plays an important role in structuring the content of Wikipedia (Voss 2005, p.
10). According to the editing guidelines every article should not only be placed in at least one
category but in all categories to which it logically belongs (Wikipedia 2009d). Thus, the num-
ber of categories an article is placed into can be used as a proxy for measuring how well the
output is structured as well as the level of integration within the encyclopedia.

Roman Pickl | The Performance of Online Communities |8|


2.3 Links between Community Characteristics and Performance

While several researchers have examined the relationships between particular characteristics
of online communities and their performance, this thesis aims to combine these attempts into
a holistic framework to analyze online communities. Therefore, in the following sections
characteristics of online communities are examined from different perspectives (see figure 1).

Figure 1: Characteristics of online communities (original illustration)

The community-centered perspective, which deals with the effects of size and heterogeneity
of the communities on the created output, is followed by the application of a more user spe-
cific view based on the activity and focus of the average community member. Last but not
least, the characteristics and effects of the users’ collaboration are reviewed in terms of inter-
activity and dynamics.

Roman Pickl | The Performance of Online Communities |9|


2.3.1 Community-Centered Perspective

With an increasing number of members communities generally have access to more resources
and better information (Butler 2001, p. 348). However, the size of a community is not the
only factor influencing its capability to perform well. Another important factor often dis-
cussed in the literature is a group’s composition and hence the heterogeneity of its members
(Horwitz & Horwitz 2007, p. 988). The community-centered perspective applied in this sec-
tion therefore examines the links between the sizes of the analyzed communities, the hetero-
geneity of their members and the quantity as well as quality of their outputs.

Size: Counting the number of members is the first thing that comes to mind, when thinking
about the characteristics of an online community. As there is still a lack of research on the
relatively new field of collaborative content creation platforms like Wikipedia, findings in the
free and open source software movement can be of help, as these two fields share considera-
bly similar philosophies (Ortega, Gonzalez-Barahona & Robles 2007, p. 47; Stvilia et al.
2005, p. 1). Thus Wikipedia, similarly to other communities, takes advantage of the collective
knowledge of its users (Stvilia et al. 2005, p. 6), thereby obeying Linus’s Law: “Given
enough eyeballs all bugs are shallow” (Raymond 2000). The number of eyeballs however, is
hard to estimate, as the number of lurkers, i.e. users who read but do not participate, is hard to
grasp (Nonnecke & Preece 1999, p. 123). In their analyses Stvilia et al. focus on people who
bother to make a change, noting that this number is “obviously much smaller and probably
more interesting and maybe correlating with the real number of eyeballs” (2005, p. 6). Strictly
speaking, only people who identify themselves with the community and as a result participate
actively are understood as members of the communities analyzed in this thesis.

Fernandez-Ramil, Izquierdo-Cortazar & Mens show that the number of unique contributors to
open source projects has a positive impact on a software’s total lines of code (2008, p. 4). A
similar relationship could be found by Ortega, Gonzalez-Barahona & Robles in Wikipedia
between the number of unique authors and an article’s size (2007, p. 52). Consequently, the
following hypothesis can be derived from these findings:

Roman Pickl | The Performance of Online Communities | 10 |


Hypothesis H1A: The quantity of information produced by an online community increases

with the number of participants.

Large groups often come up with better solutions than individual experts (Surowiecki 2005, p.
XVII; Tapscott & Williams 2008, p. 41). Similarly, open source projects benefit from peer
reviews by a large base of users (Senyard & Michlmayr 2004, p. 2). As Butler puts it “in lar-
ger social structures it is more likely that there is a member who knows the needed informa-
tion” (2001, p. 348). In the context of Wikipedia Lih reasons that “with more editors, there are
more voices and different points of view for a given subject” (2004, p. 8) and Wilkinson &
Huberman (2007, p. 4) point out that there is a strong link between the number of editors,
edits and article quality. While large groups, however, traditionally often failed to take advan-
tage of this fact due to logistical problems and various other adverse effects, today the utiliza-
tion of computer mediated communication systems has the potential to significantly reduce if
not eradicate these problems (Butler 2001, pp. 349-350; Surowiecki 2005, pp. 275-277):

Hypothesis H1B: The quality of articles produced by an online community increases with

the number of participants.

To sum up, even though the performance of large groups traditionally suffered due to prob-
lems inherent in their structure, modern information and communication technology helps to
harness their wealth of resources.

Heterogeneity: As online communities often gather around shared interests (Cosley, Ludford
& Terveen 2003, p. 8) and even inherit the risk of “balkanization”, i.e. an ongoing separation
in special interest groups (Van Alystyne & Brynjolfsson 2005, p. 851), it is interesting to ana-
lyze how the heterogeneity of community members influences the outcome of their joined
effort. In a meta-analysis of effects of team diversity on team performance Horwitz & Hor-
witz found a significant positive influence on both quantity and quality of output (2007, p.
1000). Surowiecki highlights the importance of diversity and notes that it is especially impor-
tant in small groups as they are very prone to groupthink (2005, pp. 29, 36). He concludes that

Roman Pickl | The Performance of Online Communities | 11 |


while “homogeneous groups are great at doing what they do well, […] they become progres-
sively less able to investigate alternatives” (2005, p. 31). Katz and Te’eni note that even
though different perspectives are essential, they can also increase misunderstandings and re-
quire additional effort to “overcome the gap” (2007, p. 262). Ludford et al. found that groups
consisting of dissimilar users contribute more to an online community than similar users
(2004, pp. 5,7), while Cosley, Ludford & Terveen did not find any evidence that similarity of
members influences the outcome of a task (2003, p. 6). Interestingly enough, neither of them
got the results they expected. While Ludford et al. anticipated that similar groups would con-
tribute more due to the attraction of interacting with similar others (2004, p. 7), Cosley, Lud-
ford & Terveen were looking for positive influences comparable to those experienced by tra-
ditional diverse teams (2003, p. 2). These mixed results highlight the possibility of a more
complex, curvilinear, relationship between the heterogeneity of community members and the
output of the online community. Indeed, several researchers have proposed an inverted U-
shaped curvilinear relationship between heterogeneity and outcome (Horwitz & Horwitz
2007, p. 1008):

Hypothesis H2A: The quantity of information produced by an online community follows an

inverted U-shaped curvilinear relationship with the level of its heterogeneity.

Hypothesis H2B: The quality of articles produced by an online community follows an

inverted U-shaped curvilinear relationship with the level of its heterogeneity.

2.3.2 User-Centered Perspective

This section takes a closer look at the characteristics of the individual members of the ana-
lyzed communities. Several researchers have examined the activity of community members to
measure how engaged they are with the community (Cothrel & Williams 1999, p. 56; Preece
2001, pp. 350-351). This paper not only sheds light on the often analyzed activity and partici-
pation of community members but also examines how focused their effort on the analyzed
community is.

Roman Pickl | The Performance of Online Communities | 12 |


Activity: Several researchers found that a small group of users can be credited with the ma-
jority of contributions in online community efforts (Füller, Jawecki & Mühlbacher 2007, p.
64; Mockus, Fielding & Herbsleb 2000, p. 5; Stewart 2005, p. 829). Similarly, in 2005 Wales
said that in the English Wikipedia a core group of 0.7% (524) of all users is responsible for
50% of all edits, with 2% (1400) contributing 73.4% of all edits (2005a, 22:20). What is
more, even though the number of edits does not say anything about the changed content,
Wales implied that Wikipedia is written by this core group of users (2005c). In contrast,
Swartz claimed in a blog post that most new content is added by outsiders who contribute
rarely (2006). Consequently Kittur et al. analyzed this issue and not only found that the pro-
portion of edits by “elite users” of Wikipedia has declined in recent years due to the enormous
growth of a group of low-edit users (2007, pp. 3-4), but also that more than 70% of all words
are now changed by editors with less than 10000 edits. However, they also found that experi-
enced, more active users tend to add more content (1.81 words for every word removed) than
novice users (0.86 words for every word removed) (2007, p. 6).

Hypothesis H3A: The quantity of information produced by an online community increases

with the activity of its contributors.

Kittur et al. note that even though novice users tend to delete more words than they add, they
may still increase the quality of the output (2007, p. 6). Indeed, Anthony, Smith & William-
son found that when it comes to the quality of contributions low-edit anonymous users
(“Good Samaritans”) play an equally important role as committed registered Wikipedians
(“Zealots”) (2007, p. 15). While the quality of edits by anonymous users decreases with the
number of their overall edits, the quality of contributions by registered users points in the op-
posite direction (p. 16). As this study does not distinguish between registered and anonymous
editors it is expected that these effects cancel each other out and there is no significant rela-
tionship between the quality of articles produced and the activity of users.

Hypothesis H3B: The quality of articles produced by an online community is independent

of the activity of its contributors.

Roman Pickl | The Performance of Online Communities | 13 |


Focus: With increasing complexity and size of open content projects, specialization is getting
more and more important. Indeed, contributors in open source projects are often found to spe-
cialize in specific elements of the developed software to apply their domain specific knowl-
edge (von Krogh, Spaeth & Lakhani 2003, p. 1230). Similarly, editors in Wikipedia often
choose to contribute to articles where they have specific knowledge or personal interest
(Wikipedia 2009m). It is therefore posited that this specialization and focus on specific arti-
cles not only yields an increase in the quantity of the created content but also leads to consid-
erable quality improvements.

Hypothesis H4A: The quantity of information produced by an online community increases

with the focus of its contributors.

Hypothesis H4B: The quality of articles produced by an online community increases with

the focus of its contributors.

2.3.3 Collaboration-Centered Perspective

It has been noted that collaboration and social interaction between users is an important requi-
site for the success of user communities (Füller, Jawecki & Mühlbacher 2007, p. 61) and “not
an issue that can be ignored” (Kollock 1996, 23rd paragraph). Similarly, Tapscott & Williams
point out that the success of Wikipedia “is built on the premise that collaboration among users
will improve content over time, in the way that the open source community steadily improved
Linus Torvalds’s first version of Linux” (2008, p. 71). Due to the comprehensive record of
activities in Wikipedia it is not only possible to analyze the “behaviour of information pro-
ducers” (Almeida, Mozafari & Cho 2007, p. 1) but also its effects. To address the interesting
topic of collaboration, this section deals with the interactions between contributors and the
dynamics of contributions.

Interactivity: The development of ideas and innovations is often no solitary process but
benefits from the assistance of other community members. Franke & Shah for example found
that members of user communities do not innovate in isolation but rather receive crucial ad-

Roman Pickl | The Performance of Online Communities | 14 |


vice and assistance from other members (2003, p. 158). Similarly, von Hippel notes that “in-
novation communities also tend to behave in a collaborative manner“ (2005, p. 105). Given
these findings, it comes as no real surprise that researchers have pointed out the essential need
for interactivity in online communities (Jones 1997; Schoberth, Preece & Heinzl 2003, p. 3).
What is more, researchers found that a critical mass of users is needed “to initiate a sustain-
able interactive discourse” (Schoberth, Preece & Heinzl 2003, p. 3), probably because the
number of possible interactions considerably increases with the size of the group (Butler
2001, p. 348). In the context of Wikipedia Buriol et al. identified a high number of interac-
tions between editors in articles (2006, p. 4).

The impact of interactivity on the quantity of information in Wikipedia articles may be diluted
by “edit wars”, i.e. “interactions where two people or groups alternate between versions of the
page” which are not restricted to controversial topics (Viégas, Wattenberg & Dave 2004, p.
579). However, the number of edit wars has dropped significantly in the last few years
(Viégas et al. 2007, p. 3). As a result, it is anticipated that the highlighted positive impacts
outweigh and that not only the quantity, but also the quality of the produced articles increases
with the level of interactivity:

Hypothesis H5A: The quantity of information produced by an online community increases

with the level of interactivity.

Hypothesis H5B: The quality of articles produced by an online community increases with

the level of interactivity.

Dynamics: In order to thrive communities have to be dynamic (Mynatt et al. 1998, p. 128).
This study examines the dynamics within online communities by analyzing the distribution of
contributions over time. Members of the analyzed communities can either contribute occa-
sionally or collaborate intensively on the Wikipedia article within a short period of time.

Roman Pickl | The Performance of Online Communities | 15 |


It is expected that the momentum gained in the latter case has positive effects on both the
quantity and the quality of the output produced by the online communities:

Hypothesis H6A: The quantity of information produced by an online community increases

with the level of dynamics.

Hypothesis H6B: The quality of articles produced by an online community increases with

the level of dynamics.

Roman Pickl | The Performance of Online Communities | 16 |


3 Research Method

This chapter explains the research method applied in this study in greater detail. A short in-
troduction of the research site is followed by sections dealing with the study design and the
data collection process. The chapter concludes with information on which measures were used
to operationalize each variable.

3.1 Research Site

Wikipedia, “the free encyclopedia that anyone can edit” (Wikipedia 2009l), is one of the most
successful examples of massive collaborative content development (Ortega, Gonzalez-
Barahona & Robles 2008, p. 304) and the largest encyclopedia in the world (Tapscott & Wil-
liams 2008, p. 71). It applies the “wiki”-concept, invented by Cunningham, to allow users to
easily edit articles, while saving all changes and revisions in its database (Holloway, Bozice-
vic & Börner 2007, p. 30). This history of each page provides a “design trace” of how the
article evolved (Garud, Jain & Tuertscher 2008, p. 361) and provides valuable information on
the editor, the time of the edit, and the changes committed (see figure 2 for an example).

Figure 2: Excerpt of the revision history of a Wikipedia article (Wikipedia 2009h)

Roman Pickl | The Performance of Online Communities | 17 |


Due to these comprehensive records of participation and the availability of complete data-
base-dumps (Wikimedia 2009) Wikipedia is a unique source of data (den Besten, Loubser &
Dalle 2008, p. 8).

3.2 Study Design

As the aim of this study is to analyze the relationship between characteristics of online com-
munities and the quantity and quality of output they create, this paper utilizes Wikipedia as a
natural experiment to analyze a large number of communities with diverse characteristics.
Owing to Wikipedia’s increasing popularity its article base has grown significantly over the
last few years (Viégas et al. 2007, p. 5) and consequently complete dumps of the English
Wikipedia have not only reached enormous file sizes that make them hard to analyze (den
Besten, Loubser & Dalle 2008, p. 8) but have even failed or have been corrupted recently
(Wikimedia 2008b, 2008c). Due to these disturbances and its more manageable size this paper
focuses on the German-language Wikipedia, which is, following the English version, the sec-
ond biggest of all language editions (Wikimedia 2008d). As a matter of fact, however, given
enough computing time all the analyses conducted can be easily performed on the English
version of Wikipedia as well as on an even larger sample.

To minimize biases due to changes in Wikipedia’s popularity and user base only revisions of
articles created in 2007 were analyzed. What is more, all articles edited by only one user were
excluded as they do not qualify as community effort. Of the more than 160.000 remaining
articles redirects to other articles were removed and a random sample of 5000 articles was
drawn.

The relationships between community-centered, user-centered & collaboration-centered char-


acteristics on the one hand and the output dimensions Information-Quantity and Article-
Quality on the other hand as discussed in chapter 2 are examined in this natural experiment
based on the last version of each article in 2007. Additionally, the effect of the age of the ana-
lyzed articles (time passed since its first edit) was controlled for to rule out the possible alter-
native explanation that older communities had more time to create extensive and high quality
content. Figure 3 provides an overview of the assumed relationships that are scientifically
tested in this study.

Roman Pickl | The Performance of Online Communities | 18 |


Figure 3: Research model (original illustration)

3.3 Data Collection and Cleansing

Complete database dumps of Wikipedia and its sister projects are provided online by the
Wikimedia Foundation Inc. (Wikimedia 2009). Even though dumps including all pages with
complete revision history are available, given the huge amount of data the “stub-meta-
history.xml.gz” dump was used, which does not include any page text, but complete revision
metadata. The dump from June 7th of the German Wikipedia (Wikimedia 2008a) was
downloaded in August 2008 and imported into a MySQL database using the MWDumper-tool
(MediaWiki.org 2009).

Almeida et al. mention that Wikipedia dumps are often incomplete due to errors occurred dur-
ing their generations (2007, p. 2). Similar problems were found in the dump analyzed in this
paper where the table containing all pages, was out of sync with the pages included in the
table storing all revisions. Consequently, distinct pages in the revision table were used as a
basis and, where necessary, missing values queried from the Wikipedia API (Wikipedia
2009g).

Roman Pickl | The Performance of Online Communities | 19 |


The Wikipedia database-dump consists of more than two million pages. However, as the
Wikipedia database is not only comprised of Wikipedia articles but includes various other
pages as well (e.g. talk pages, user pages, image pages etc. see: Wikipedia 2009p) all pages
not belonging to the main namespace, which consists of all articles ever written, were re-
moved from the database. What is more, redirects to other articles were removed as well. A
random sample of 5000 articles created in 2007 with more than one user was then drawn from
the remaining database with standard SQL-statements (see figure 4 for an overview of the
sampling process).

Figure 4: Sampling process (original illustration)

To examine the output of each community in greater detail the last revision of the year 2007
was downloaded of each article using a Python script (Gude 2008) which was adapted to the
German version of Wikipedia. The yielded XML files include, amongst others, information
about the article and the author, time and text (including wiki markup; see Wikipedia 2009v)
of each revision (examples can be found in the appendix).

Roman Pickl | The Performance of Online Communities | 20 |


To calculate the readability of an article, a plain text version of the article content was re-
quired. Instead of removing the wikitext markup, however, it became evident that web scrap-
ing the content from the Wikipedia homepage was easier to accomplish. If an article is ac-
cessed online the wikitext markup is parsed into formatted HTML text. What is more, the
content of an article is preceded and followed by specific HTML comments (<!--start content
-->, <!--end content -->) in the webpage’s source code. Hence, with the help of regular ex-
pressions an article’s content can easily be extracted and HTML markup removed to yield the
plain text version needed.

Due to the huge amount of articles under study a parser was developed in the Python pro-
gramming language to automatically obtain and analyze the files discussed above (an in-depth
explanation of this method can be found in the appendix). In this process edits by bots, i.e.
“automated or semi-automated tools that carry out repetitive and mundane tasks” (Wikipedia
2009c), were determined on the basis of a recent user-group assignment list (Wikipedia
2009b) and omitted in the analyses. It is important to note, however, that these assignments
are not static and may have changed since 2007, resulting in bots not recognized correctly by
the parser.

Vandalism is another topic which needs to be addressed in this context. Due to the low entry
barrier Wikipedia is quite vulnerable to vandalism. However, due to the fact that all revisions
are stored in the database, malicious edits can be fixed easily and fast. Indeed Wikipedians do
a very good job as flawed articles are often amended within minutes (Viégas, Wattenberg &
Dave 2004, p. 579). Vandalism can occur in various forms and is often hard to detect auto-
matically as there is no crystal clear definition of vandalism in Wikipedia (pp. 578-579). To
reduce the number of false positives only two often unambiguous cases were marked as van-
dalism in this analysis:

 Mass deletion of all content on a page

 More than 90% of content was deleted, the remaining text has less than 500 characters
and no meaningful comment was created (Wikipedia 2009w)

Roman Pickl | The Performance of Online Communities | 21 |


However, the majority of articles examined by hand showed low levels of vandalism, proba-
bly because vandals tend to “specialize” on very popular articles (Wikipedia 2009u) which are
often older than the articles analyzed in this study. Edits identified by the above explained
measures were omitted and vandals were excluded from further analysis of the characteristics
of the specific community.

3.4 Measures

This section deals with the way the discussed factors of performance and community charac-
teristics were operationalized and explains the applied metrics.

3.4.1 Operationalization of Performance Indicators

As already discussed in chapter 2.2 the constructs Information-Quantity and Article-Quality


consist of several variables.

3.4.1.1 Information-Quantity

The following three variables were standardized and averaging to build the Information-
Quantity construct:

Number of words: To quantify the length of an article the words included in the article were
counted. Therefore the markup was stripped from the HTML versions of each article to yield
a plain text version. This text was split into individual words at every white-space character
with the help of regular expressions.

Vocabulary: The number of unique words was calculated accordingly. For simplicity reasons
no stemming was conducted and stop words were not removed.

Number of links: Regular expressions were used to determine outgoing links on every ana-
lyzed article page. Duplicate links and page internal links were omitted.

Roman Pickl | The Performance of Online Communities | 22 |


3.4.1.2 Article-Quality

The Article-Quality construct was created by standardizing and averaging the following two
variables:

Readability: The Flesch reading ease is a function of the average sentence length (ASL;
words per sentence) and the average number of syllables per word (ASW) (den Besten, Loub-
ser & Dalle 2008, p. 10):

FRE English = 206.835 − 1.015 ⋅ ASL − 84.6 ⋅ ASW

This formula yields a number between 0 (very difficult) & 100 (very easy), with standard
English texts usually scoring a number between 60 and 70 (den Besten, Loubser & Dalle
2008, p. 11). Flesch readability scores were calculated for the plain text versions of each arti-
cle’s last revision in 2007 using an online tool (stilversprechend.de 2009a) which applies an
adapted version of the formula for the German language (stilversprechend.de 2009b):

FRE German = 180 − ASL − 58.5 ⋅ ASW

Number of categories: In Wikipedia an article can be placed into a category by adding a spe-
cific category tag (“[[Kategorie:Category name]]“ in the German language version, Wikipedia
2009f) to the page. Occurrences of these tags were counted in the last revision of 2007 to cal-
culate the number of categories for each analyzed article.

3.4.2 Operationalization of Community Characteristics

In the following paragraphs the operationalization of community-centered, user-centered and


collaboration-centered variables are outlined.

3.4.2.1 Community-Centered Perspective

Size: Users who want to contribute to an article in Wikipedia have two options: they can ei-
ther sign up to Wikipedia or choose to remain anonymous. Whereas in the former case their
username is associated with their revisions, their IP address is stored in the latter case. It has

Roman Pickl | The Performance of Online Communities | 23 |


been argued that anonymous users play an important role in the creation of content (Anthony,
Smith & Williamson 2007, p. 15; Kittur et al. 2007, p. 6; Viégas, Wattenberg & Dave 2004,
p. 580). Consequently, this paper examines the number of distinct users regardless of whether
they sign up/in or choose to stay anonymous, even though Stvilia et al. point out that this
number can only be an approximation of the actual number of distinct editors as e.g. an indi-
vidual may make edits with more than one username (2005, p. 5).

Heterogeneity: On average, users in the analyzed sample have edited 265 articles in 2007.
Consequently, the 265 most important articles (i.e. articles most members of the community
contributed to during the year 2007) were queried from the created tables with standard SQL
query statements when analyzing the heterogeneity of the members of a community. In the
next step a vector was created for each community user depicting the editing patterns in those
articles he/she co-authored (number of edits) and which article he/she didn’t edit (“0”). These
vectors of edited articles can be understood as areas of “common interest” (Korfiatis, Poulos
& Bokos 2006, p. 256), “interest profiles” (Cosley, Ludford & Terveen 2003, p. 2) or
“knowledge profiles” (Van Alystyne & Brynjolfsson 2005, p. 854). The similarities of each
user to every other user were then computed by calculating the cosine of each knowledge pro-
file pair (Manning & Schütze 2003, p. 300):

x⋅y
cos(x, y) =
| x || y |

This often called cosine-similarity is the cosine of the angle between two vectors and has al-
ready been used in other studies when analyzing the similarity of community members (for
example in: Cosley, Ludford & Terveen 2003, p. 4; Van Alystyne & Brynjolfsson 2005, p.
854). Similarly to Van Alystyne & Brynjolfsson’s approach, groups of users are compared in
this paper by the average similarity of their profiles (2005, p. 854). The heterogeneity was
then calculated by subtracting a community’s average similarity (a number between 0 and 1)
from 1:

Heterogeneity = 1− Average Similarity

Roman Pickl | The Performance of Online Communities | 24 |


3.4.2.2 User-Centered Perspective

Activity: To measure the general activity of community members, the number of edits in all
articles in Wikipedia in the year 2007 per user was queried from the database and averaged
for each community. Let us suppose a community consists of two contributors. A made 100
edits in Wikipedia articles in 2007, whereas B contributed 200 times. The activity of users in
this community hence amounts to 150 edits.

Focus: To assess the level of commitment in each community the average proportion of their
members’ activity in the analyzed community was calculated. To proceed with the previous
example: A and B made 10 edits in the analyzed article. The focus of users in this community
hence amounts to 0.075 (A: 10/100; B: 10/200).

3.4.2.3 Collaboration-Centered Perspective

Interactivity: All edits in 2007 of each article in the sample were analyzed in this paper. In a
first step the number of interactive edits was counted i.e. the first edit and all edits that were
preceded by an edit of another community member. The level of interactivity was then calcu-
lated as the ratio between interactive edits minus the number of distinct authors and the total
number of edits:

Interactive Edits − Users


Interactivity =
Total Edits

Let us suppose that four users (A, B, C, D) created an article and the revision history reveals
the following eight edits: A B C D A B C D. The interactivity level of this example amounts
to 0.5 as all edits by those four contributors are interactive edits ([8-4]/8).

Dynamics: To assess the dynamics within communities the median time between edits was
calculated for each article. This metric allows analyzing whether community members inten-
sively edited the article in a short period of time or if their efforts were distributed over the
whole year 2007. As less dynamic communities exhibit higher median times between edits the
yielded figure was multiplied by -1 to ease interpretation.

Roman Pickl | The Performance of Online Communities | 25 |


4 Results

The following chapter presents the results of this study in two sections. While the first sec-
tion, Descriptive Statistics, provides an in-depth descriptive analysis of the articles and com-
munities in the sample, the second section, Inferential Statistics, contains the results of two
ordinary least squares (OLS) regressions used to statistically test the developed hypotheses on
the relationships between characteristics of online communities and their performance.

4.1 Descriptive Statistics

The random sample drawn from the Wikipedia database consists of 5000 articles that were
created in 2007 and edited by more than one user. Due to these sampling criteria it is no sur-
prise that the average article age is slightly skewed towards older articles that had more time
to attract enough contributors and amounts to 195.81 days (standard deviation: 105.55) with a
minimum of 0.43 days and a maximum of 364.94 days. Furthermore, due to the fact that
Wikipedia is the largest encyclopedia in the world (Tapscott & Williams 2008, p. 71) and is
still growing (Wikipedia 2009i), it is plausible that articles created in 2007, as evident from
the sample, often cover very specific, niche topics or recent events.

Community-centered perspective: Online communities in the sample generally show a


moderate number of contributors with an average of 5.57 users (s.d.: 5.67). As already prede-
termined by the sample criteria the minimum number of users found in the sample is 2. How-
ever, there are also a number of outliers. The most extreme outlier is the community that col-
laborated on the article about Knut the famous polar bear born in the Berlin Zoo with 230
contributors. Concerning the heterogeneity of community members, a considerable diversity
was found. The average level of heterogeneity amounts to 0.8252 (s.d.: 0.1272) with a mini-
mum of 0.00 and a maximum of 0.9998.

User-centered perspective: On average, users in a community show an activity of 5135.79


contributions (s.d.: 4305.81). The minimum activity amounts to 2 edits, while the maximum is
38029. Furthermore, community members tend to work on more than one article, as the aver-

Roman Pickl | The Performance of Online Communities | 26 |


age focus only amounts to 9.04 percent (s.d.: 13.26) with a minimum of 0.00 percent and a
maximum of 91.34 percent.

Collaboration-centered perspective: The average interactivity level in the sampled commu-


nities amounts to 0.0908 (s.d.: 0.1079) with a minimum of 0 and a maximum of 0.6. The me-
dian time between edits averages to: 8.77 days (s.d.: 24.58) with a minimum of 21 seconds
and a maximum of 283.93 days.

Information-Quantity: The average article in the sample consists of 382.77 words (s.d.:
543.64). There is, however, also an article without any words in the sample, as its last revi-
sion of the year 2007 did not contain any content. The longest article deals with an in-depth
description of the course of the NHL season 2007 (11784 words). What is more, 220.06
unique words (s.d.: 216.78) are used on average in each article. While the minimum is again
0, stemming from the empty article discussed above, the article with the most unique words
contains a table on Chinese Unicode characters (3738 words). The average number of unique
links amounts to 33.2 (s.d.: 37.23) per article. The minimum number of links is once more 0
due to the empty article, while the article with the highest number of unique links lists all fe-
male Olympic medalists in athletics (789 links).

Article-Quality: The average Flesch readability score of articles in the sample amounts to
56.97 (s.d.: 11.42) depicting a reasonable readability (stilversprechend.de 2009b). For six arti-
cles, however, no valid readability values could be determined due to insufficient length of
the articles’ content. What is more, looking at the most extreme outliers (min: 5; max: 100)
reveals that articles consisting of mere tables and lists cannot be assessed well with the help of
the Flesch readability function. On average articles in the sample are placed into 3.05 catego-
ries (s.d.: 2.15). While there are several articles in the sample which are not part of any cate-
gory, an article dealing with the achievements of a German silviculture scientist shows the
highest number of categories (18).

Roman Pickl | The Performance of Online Communities | 27 |


4.2 Inferential Statistics

The hypotheses developed in chapter 2 were statistically tested with the help of two OLS-
regressions. While the first regression deals with the influences of community characteristics
on the quantity of information produced, the second regression examines their effects on the
articles’ quality. Table 1 summarizes the test results derived from these two regressions:

Dependent Variables

Hypotheses Information- Hypotheses Article-Quality1


Quantity1

Community-Centered

# Users H1A(+) 0.290*** H1B(+) 0.080***

Heterogeneity -0.090*** -0.077***


H2A(∩) H2B(∩)
Heterogeneity² -0.067** -0.083***

User-Centered

Activity H3A(+) -0.027† H3B(0) 0.014

Focus H4A(+) -0.054*** H4B(+) -0.102***

Collaboration-Centered

Interactivity H5A(+) 0.089*** H5B(+) 0.020

Dynamics H6A(+) 0.021 H6B(+) 0.076***

Article age -0.059*** -0.008

R² (R² adjusted) 0.105 (0.103) 0.025 (0.023)

F-Value 74.175 15.967

p-Value 0.000 0.000


p < .10 (two-tailed test), * p < .05 (two-tailed test), ** p < .01 (two-tailed test), *** p < .001 (two-tailed test);
articles: n=5000; 1 values are standardized coefficients (β-values); predictors were standardized before entry

Table 1: Results of the conducted OLS-regressions

Roman Pickl | The Performance of Online Communities | 28 |


4.2.1 Results related to Information-Quantity

The fit indices for the OLS-regression on Information-Quantity indicate a good fit of the
model with an adjusted R-square of 0.103 (F-Value: 74.175; p-Value: 0.000).

The test of H1A revealed that the quantity of information produced by an online community,
as predicted, increases with the number of contributors (β= 0.290***).

Furthermore, the coefficient for the quadratic term of heterogeneity shows the expected nega-
tive sign, which is indicative of an inverted U-shaped relationship between Information-
Quantity and heterogeneity (Aiken, West & Reno 1991, p. 65). Using differential calculus, the
maximum point of the inverted U can easily be calculated (Aiken, West & Reno 1991, p. 65;
Eisinga, Scheepers & van Snippenburg 1991, p. 113) and is located at a heterogeneity level of
0.58. In order to be able to compare the effect of heterogeneity with the standardized regres-
sion coefficients of other predictors, the method outlined in Eisinga, Scheepers & van Snip-
penburg (1991, p. 109) was used to obtain a composite effect of the linear and quadratic term.
The standardized regression coefficient of this combined effect amounts to 0.064 and is
highly significant (p<0.000). Note that even though the sign of this coefficient is a technical
artifice (Eisinga, Scheepers & van Snippenburg 1991, p. 110) and is hence not related to the
sign of the relationship between independent and dependent variable, its size allows investi-
gating the relative importance of heterogeneity for the explanation of the dependent variable
Information-Quantity. These results support H2A.

Even though, positive impacts of the activity and focus of community members on the quan-
tity of output were predicted, the analysis revealed negative relationships (activity: β= -
0.027†; focus: β= -0.054***). Consequently, H3A and H4A had to be rejected.

The analysis provides support for H5A, as the level of interactivity within the examined
communities has a highly significant positive influence on the information quantity (β=
0.089***).

H6A, however, which predicted a positive influence of collaboration dynamics did not find
empirical support as the relationship was not significant (β= 0.021; p= 0.139).

The control variable article age exerts a negative influence on the quantity of information (β=
-0.059***).

Roman Pickl | The Performance of Online Communities | 29 |


4.2.2 Results related to Article-Quality

The indicators of how well the model fits the data point out a moderate fit with an adjusted R-
square of 0.023 (F-Value: 15.967; p-Value: 0.000).

The analysis reveals that the quality of the articles produced, as predicted, increases with the
number of contributors (β= 0.080***). Hence, H1B was supported by the data.

Again, the coefficient for the quadratic term of heterogeneity shows the expected negative
sign, suggesting an inverted U-shaped relationship between Article-Quality and heterogeneity.
The maximum point of the inverted U is located at a heterogeneity level of 0.66. What is
more, the standardized regression coefficient of the combined effect amounts to 0.062 and is
highly significant (p<0.000). These results support H2B.

As predicted in H3B, no evidence of a relation between the activity of community members


and the quality of the output could be found (β= 0.014; p= 0.375).

H4B, which predicted a positive influence of focus of community members, had to be rejected
as the impact turned out to be negative (β= -0.102***).

H5B, positing a positive influence of interactivity, did not find empirical support in the data.
Even though the sign is as expected, the effect is not significant (β= 0.020; p= 0.153).

The expected positive influence of the collaboration dynamic (H6B) found support in the data
(β= 0.076***).

Article age was controlled for but showed no significant impact on the quality of produced
articles (β= -0.008; p= 0.601).

Roman Pickl | The Performance of Online Communities | 30 |


5 Discussion and Implications

The aim of this study was to investigate the relationship between the characteristics of online
communities and their performance. Therefore the output and characteristics of 5000 commu-
nities gathering around Wikipedia articles were analyzed from different perspectives. The
results demonstrate that the number of users is by far the most influential force that drives
content creation. When it comes to the quality of the created output, however, characteristics
of community members and how they collaborate are as important as the sheer number of
contributors. The following paragraphs review and discuss these and other findings in greater
detail.

Amongst others, the study revealed that the quantity of information created by an online com-
munity is related to a number of community characteristics. Table 2 summarizes those find-
ings:

Hypotheses: The quantity of information produced by an online


Perspective Supported
community …

H1A: … increases with the number of participants. YES


Community-
Centered H2A: … follows an inverted U-shaped curvilinear relationship with the YES
level of its heterogeneity.

User- H3A: … increases with the activity of its contributors. NO

Centered H4A: … increases with the focus of its contributors. NO

H5A: … increases with the level of interactivity. YES


Collaboration-
Centered
H6A: … increases with the level of dynamics. NO

Table 2: Summary of results for hypotheses H1A-H6A (Information-Quantity)

Both hypotheses regarding community-centered characteristics found support in the data. The
analysis showed that the size of the community has by far the biggest influence among all
factors, with larger communities tending to create more output. Furthermore, the output of

Roman Pickl | The Performance of Online Communities | 31 |


community users, as predicted, follows an inverted U-shaped curvilinear relationship concern-
ing their heterogeneity, probably the result of disagreement over what should or should not be
part of an article.

When it comes to the hypotheses concerning user-centered characteristics, neither the ex-
cepted positive influence of activity nor the posited positive influence of focus was supported
by the data as both influences turned out to be negative. In their analyses Kittur et al. found
that more active and experienced users tend to add more content than novice users (2007, p.
6). They, however, calculated these numbers over the whole time Wikipedia had been in exis-
tence and this reported trend may have shifted over recent years, especially in newly created
articles that, as already discussed above, nowadays often cover very specific, niche topics.
What is more, as the activity of community members in this study was measured as the num-
ber of contributions in 2007, this finding may be diluted by the fact that experienced users
that were very active in previous years and curbed their activity in 2007 were counted as oc-
casional contributors. Regarding the focus of community members on specific articles it was
expected that specialization leads to an increase in the output created. However, it turned out
that online communities benefit if their members are not too focused on a task. It seems as if
not only experts in a field but also novice users can contribute considerably to an open-
content project. Nevertheless, further research is needed to clarify the impact of activity and
focus on the quantity of content created.

What is more, whereas evidence of the positive influence of interactivity on the quantity of
created content was found, the impact of dynamics was positive but not significant. Thus, it
could be shown that the output increases as community members do not work in solitary con-
ditions but assist each other and collaborate interactively.

If at all, a positive impact of the control variable article age was expected. The negative influ-
ence found, however, points out that even young communities can be very productive. Some
of the articles examined may not only have grown over time, but also might have been short-
ened again. These shrinkages can happen if text is deleted or more dramatically if an article is
split and large sections of it are moved to a more specific page (Viégas, Wattenberg & Dave
2004, p. 580).

Roman Pickl | The Performance of Online Communities | 32 |


The second OLS-regression conducted showed the links between the quality of the output
created by an online community and a number of community characteristics. Table 3 summa-
rizes those findings:

Hypotheses: The quality of articles produced by an online commu-


Perspective Supported
nity…

H1B: … increases with the number of participants. YES


Community-
Centered H2B: … follows an inverted U-shaped curvilinear relationship with the YES
level of its heterogeneity.

User- H3B: … is independent of the activity of its contributors. YES

Centered H4B: … increases with the focus of its contributors. NO

H5B: … increases with the level of interactivity. NO


Collaboration-
Centered
H6B: … increases with the level of dynamics. YES

Table 3: Summary of results for hypotheses H1B-H6B (Article-Quality)

The analysis provided support for both hypotheses regarding the impact of community-
centered characteristics. Larger communities tend to create output of higher quality. In con-
trast to the results on Information-Quantity, however, community size is not the most impor-
tant factor. Again, evidence for an inverted U-shaped curvilinear relationship between the
heterogeneity of community members and the quality of their output was found. This finding
highlights the importance of a moderate level of heterogeneity in an online community.

Regarding the user-centered characteristics, as predicted, the average activity had no signifi-
cant influence on the quality of the communities’ output. Furthermore, it became evident that
the posited positive influence of focus is in fact negative and the most important of all factors
influencing an article’s quality. Concerning the influence of activity additional research is
needed to test whether a distinction between user-groups can replicate the findings of An-
thony, Smith & Williamson who found that the quality of edits by anonymous users decreases
with the number of their overall edits while the quality of contributions by registered users
points in the opposite direction (2007, p. 16). Similarly to the findings on Information-

Roman Pickl | The Performance of Online Communities | 33 |


Quantity, the results regarding the influence of focus on the quality of the created output im-
ply that communities benefit significantly from contributions by users who are not specialized
on an article but work on a variety of topics, or as Wikipedians put it:

“It turns out that in some ways, analytic skills and neutrality often play a greater role
than specialisation; editors who have worked for a time on a variety of articles usually
become quite capable of making good quality editorial decisions regarding specialist
material, even on unfamiliar technical subjects” (Wikipedia 2009m).

However, every article needs some experts that watch for and correct errors (Wikipedia
2009m). In line with these findings, Williams & Cothrel (2000, p. 90) stress the importance of
maintaining a balance between experts and novice users. Nevertheless, further research is
needed to clarify these connections in greater detail.

Finally, the effects of both collaboration-centered characteristics interactivity and focus show
the expected positive trend. The effect of interactivity, however, is not significant and hence
needs further clarification. The importance of intensive collaboration in online communities is
further stressed by the evident positive impact of its dynamics on the quality of the produced
output.

Following, this discussion of the link between characteristics of online communities and their
performance, it is of interest to examine which communities generate both extensive and high
quality content. Therefore, the significant effects of community characteristics on both per-
formance indicators Information-Quantity and Article-Quality found in this study are summa-
rized in table 4.

Roman Pickl | The Performance of Online Communities | 34 |


Information-Quantity Article-Quality

Community-Centered

# Users + +

Heterogeneity ∩ ∩

User-Centered

Activity -

Focus - -

Collaboration-Centered

Interactivity +

Dynamics +

Table 4: Significant effects of community characteristics on performance

Looking at this table it becomes evident that when taking both quantity and quality into ac-
count, those communities perform best that consist of a large number of users with a moderate
level of heterogeneity and a fair share of occasional and novice contributors who operate in a
variety of fields and collaborate interactively and dynamically.

5.1 Implications for Theory

While previous approaches often examined particular aspects of online communities, this
study introduced a framework to combine several of these perspectives to analyze the link
between community characteristics and performance. Furthermore, it extended the current
literature on online communities by utilizing Wikipedia as a massive experiment to analyze
5000 diverse communities and thereby empirically testing and substantiating these relation-
ships.

Even though additional research is needed to further clarify certain findings, it was shown that
not only general characteristics of online communities but also user specific characteristics

Roman Pickl | The Performance of Online Communities | 35 |


play an important role in the creation of extensive and high quality content. What is more, the
results also stress the importance of taking the specific way members of online communities
collaborate into account.

Future research may build on these findings and the approach applied in this study to develop
an even more detailed framework for analyzing the link between characteristics of online
communities and their performance.

5.2 Implications for Methods

In this study a scaleable approach was introduced to automatically analyze diverse communi-
ties in online environments. Thanks to the availability of and easy access to its database
Wikipedia is a unique source of data that proved to be a good research site for natural experi-
ments and yielded considerable insights into the link between characteristics and performance
of online communities.

The most accurate analysis of communities in Wikipedia could be gained from analyzing all
the available data. However, the databases of all popular language editions have grown to
enormous sizes and hence working with a sample seems to be the best way to proceed. Due to
limited computing resources and the large size of the English Wikipedia database it was de-
cided in this study to analyze a sample of 5000 communities in the German version of
Wikipedia. Given more computing time, however, the analyses conducted can be easily ex-
tended to a bigger sample as well as different language editions due to the efficient and scale-
able approach applied in this study to compare and validate results. What is more, boundaries
of analyzed communities can be enlarged to not only examine communities gathering around
individual articles but larger communities e.g. in WikiProjects, which are collections of arti-
cles that deal with specific topics (Wikipedia 2009t).

As this study analyzed the output of communities based on the latest version of the year 2007,
advancing the introduced method to see how the output changes and evolves over time and in
years to come may yield additional insights. This information can be easily extracted from the
collected data by the developed parser.

Roman Pickl | The Performance of Online Communities | 36 |


Even though the data available on Wikipedia and its sister projects (e.g. Wikibooks, Wiktion-
ary, Wikinews etc.) still promise various ways of analyses and research, the parser developed
can also easily be expanded and adapted to analyze other online environments to increase the
generalizability of the results.

5.3 Implications for Practice

The results of this study show that the performance of online communities not only depend on
general community characteristics like size and heterogeneity but also on more user specific
characteristics such as activity and focus. What is more, how these users collaborate plays an
important role in influencing content quantity and quality. These findings suggest that com-
munity operators can pro-actively influence the performance of online communities by pro-
viding favorable conditions.

As already mentioned before, Wikipedia is one of the most successful examples of mass-
collaboration, most likely due to the favorable conditions provided by its operators and the
software used. While the low entry barriers for contributors for example allow novice users to
contribute without going trough a lengthy sign-up process often found in other online envi-
ronments, the applied “wiki”-concept ensures that they cannot do any real harm. What is
more, several tools like watch lists and revision histories support contributors in collaborating
interactively and dynamically.

Community operators can learn from the presented results and Wikipedia, as a best-practice
example, to apply appropriate strategies and tools in their effort to influence the performance
of communities.

When involving online communities in their innovation process, businesses generally have
two distinct options: they can either try to find and harness an already existing community or
attempt to build their own (Franke 2005, p. 708). Either way, results of this study imply that
they should aim for the following characteristics of online communities to foster the creation
of content which is both extensive and of good quality:

Roman Pickl | The Performance of Online Communities | 37 |


 A large community consisting of moderately heterogeneous users

 Low entry barriers that allow both novice users and experts to collaborate on various
tasks and topics

 An environment which not only supports but fosters interactive and dynamic collabo-
ration

5.4 Limitations

The methods employed in this study have a number of inherent limitations and involve a
number of assumptions that are challenged and discussed in the following paragraphs.

This study used cross-sectional data to examine the link between several community charac-
teristics and the performance of online communities. Even if most of the developed hypothe-
ses were supported by the data and several meaningful correlations could be found, it could
still be that this study mixed up cause and effect. Longer articles for example may attract
more contributors than shorter articles and not the other way around. Longitudinal analyses
may allow stronger causal claims than the approach applied.

Due to the fact that this analysis was not a controlled experiment in a laboratory setting but
rather a natural experiment, not all variables could be controlled. External influences can
hence not be ruled out and unmeasured variables may have had a significant impact on the
results. Especially the low fit of the OLS-regression on Article-Quality highlights that some
important variables may have been omitted.

To keep the research design concise and easy to understand only the main effects of inde-
pendent variables were analyzed in this study. However, during the analyses it became evi-
dent that there could be significant interactions between several community characteristics
discussed in this paper. Including interactions between these variables in the analyses may
yield a more complex, yet more comprehensive model and increase its fit with the data.

Roman Pickl | The Performance of Online Communities | 38 |


Last but not least, the communities analyzed in the context of Wikipedia may have systematic
differences to communities in other online environments. The generalizability of the findings
of this study is therefore limited and results should be interpreted with caution.

5.5 Directions for Future Research

Social scientists often moan about the difficult access to data for research. In the case of
Wikipedia, quite the opposite is the case. Even though full dumps of Wikipedia and its sister
projects and hence comprehensive records of collaboration are available, the enormous
amount of data is quite hard to handle. The scaleable approach introduced in this study can be
enhanced and applied to several interesting research questions.

Wikipedians have recently started a project to assess every article in Wikipedia (Wikipedia
2009k). While this scheme has not yet been adopted in the German language version
(Wikipedia 2009a) and could hence not be used in this study, future studies can draw upon
this valuable resource to better quantify the performance of online communities.

Furthermore, more and more Wikipedians gather around WikiProjects, collections of articles
that deal with specific topics (Wikipedia 2009t). Analyzing these large communities of inter-
est in combination with the widely used article assessments may yield additional insights.

Regarding the used measures, future research could dig deeper e.g. by analyzing the access
levels of contributing users (Wikipedia 2009r), the number of barn stars (Wikipedia 2009o)
they have received and the type of comments on their user pages (Wikipedia 2009s) to de-
scribe the characteristics of community users in greater detail.

What is more, each article in Wikipedia has a talk page that is used for editorial coordination
(Wikipedia 2009q). These talk pages are another valuable resource to analyze collaboration
characteristics. Further research may relate discussions on talk pages to the creation of content
in the article to gain valuable insights.

Roman Pickl | The Performance of Online Communities | 39 |


Even though Wikipedia has already been the subject of many studies, the growth of other
Wikimedia projects (Wikipedia 2009n) as well as the availability of extensive data dumps
(Wikimedia 2009), various statistics (Wikipedia 2009j), traffic data (stats.grok.se 2009) and
external quality indicators like Google’s page rank promise considerable possibilities for fur-
ther interesting research.

Roman Pickl | The Performance of Online Communities | 40 |


6 References

Aiken, L.S., West, S.G. & Reno, R.R. 1991, Multiple regression: testing and interpreting
interactions, SAGE Publications Newbury Park, CA.

Almeida, R.B., Mozafari, B. & Cho, J. 2007, 'On the Evolution of Wikipedia', International
Conference on Weblogs and Social Media, Boulder, Colorado, USA,
<http://www.icwsm.org/papers/2--Almeida-Mozafari-Cho.pdf>.

Anthony, D., Smith, S.W. & Williamson, T. 2007, The Quality of Open Source Production:
Zealots and Good Samaritans in the Case of Wikipedia,
<http://www.cs.dartmouth.edu/reports/TR2007-606.pdf>.

Barabási, A.-L. 2003, Linked, Plume, New York.

Bruckman, A. 2006, 'A New Perspective on “Community” and its Implications for Computer-
Mediated Communication Systems', paper presented to the CHI 2006, Montréal, Qué-
bec, Canada, <http://www.cc.gatech.edu/~asb/papers/bruckman-community-
chi06.pdf>.

Buriol, L.S., Castillo, C., Donato, D., Leonardi, S. & Millozzi, S. 2006, 'Temporal Analysis of
the Wikigraph', paper presented to the 2006 IEEE/WIC/ACM International Conference
on Web Intelligence, Hong Kong,
<http://www.inf.ufrgs.br/~buriol/papers/buriol_2006_temporal_analysis_wikigraph.pd
f>

Butler, B.S. 2001, 'Membership Size, Communication Activity, and Sustainability: A Re-
source-Based Model of Online Social Structures', Information Systems Research, vol.
12, no. 4, pp. 346-362.

Cosley, D., Ludford, P. & Terveen, L. 2003, 'Studying the Effect of Similarity in Online
Task-Focused Interactions', 2003 international ACM SIGGROUP conference on Sup-
porting group work, Sanibel Island, Florida, USA pp. 321-329
<http://www.grouplens.org/papers/pdf/simex-group2003.pdf>.

Roman Pickl | The Performance of Online Communities | 41 |


Cothrel, J. & Williams, R.L. 1999, 'On-line communities: helping them form and grow', Jour-
nal of Knowledge Management, vol. 3, no. 1, pp. 54-60.

Cothrel, J.P. 2000, 'Measuring the success of an online community', Strategy & Leadership,
vol. 28, no. 2, pp. 17-21.

den Besten, M., Loubser, M. & Dalle, J.-M. 2008, Wikipedia as a Distributed Problem-
Solving Network,
<http://www.oii.ox.ac.uk/downloads/index.cfm?File=research/dpsn/Wikipedia_full.pd
f>.

Eisinga, R., Scheepers, P. & van Snippenburg, L. 1991, 'The standardized effect of a com-
pound of dummy variables or polynomial terms', Quality & Quantity, vol. 25, pp. 103-
114.

Fernandez-Ramil, J., Izquierdo-Cortazar, D. & Mens, T. 2008, 'Relationship between Size,


Effort, Duration and Number of Contributors in Large FLOSS projects', BENEVOL
2008, Eindhoven,
<ftp://ftp.umh.ac.be/pub/ftp_infofs/2008/Benevol2008RamilEtAl.pdf>.

Franke, N. 2005, 'Open Source & Co.: Innovative User-Netzwerke', in S. Albers & O. Gass-
mann (eds), Handbuch Technologie- und Innovationsmanagement, Gabler, Wiesba-
den, pp. 695-712.

Franke, N. & Shah, S. 2003, 'How communities support innovative activities: an exploration
of assistance and sharing among end-users', Research Policy, vol. 32, no. 1, pp. 157-
178.

Füller, J., Jawecki, G. & Mühlbacher, H. 2007, 'Innovation creation by online basketball
communities', Journal of Business Research, vol. 60, no. 1, pp. 60-71.

Füller, J., Matzler, K. & Hoppe, e. 2008, 'Brand Community Members as a Source of Innova-
tion', Journal of Product Innovation Management, vol. 25, no. 6, pp. 609-619.

Garud, R., Jain, S. & Tuertscher, P. 2008, 'Incomplete by Design and Designing for Incom-
pleteness', Organization Studies, vol. 29, pp. 351-371.

Giles, J. 2005, 'Internet encyclopaedias go head to head', Nature, vol. 438, pp. 900-901.

Roman Pickl | The Performance of Online Communities | 42 |


Holloway, T., Bozicevic, M. & Börner, K. 2007, 'Analyzing and visualizing the semantic
coverage of Wikipedia and its authors', Complexity, vol. 12, no. 3, pp. 30-40.

Horwitz, S.K. & Horwitz, I.B. 2007, 'The Effects of Team Diversity on Team Outcomes: A
Meta-Analytic Review of Team Demography', Journal of Management, vol. 33, no. 6,
pp. 987-1015.

Jones, Q. 1997, 'Virtual-Communities, Virtual Settlements & Cyber-Archaeology: A Theo-


retical Outline', Journal of Computer-Mediated Communication, vol. 3, no. 3.

Katz, A. & Te'eni, D. 2007, 'The Contingent Impact of Contextualization on Computer-


Mediated Collaboration', Organization Science, vol. 18, no. 2, pp. 261-279.

Kittur, A., Ch, E., Pendleton, B.A., Suh, B. & Mytkowicz, T. 2007, 'Power of the Few vs.
Wisdom of the Crowd: Wikipedia and the Rise of the Bourgeoisie', CHI 2007, San
Jose, CA, <http://www.parc.com/research/publications/files/5904.pdf>.

Koch, M. 2002, 'Requirements for community support systems - modularization, integration


and ubiquitous user interfaces', Behaviour & Information Technology, vol. 21, no. 5,
pp. 327-332.

Kollock, P. 1996, 'Design Principles for Online Communities', First International Harvard
Conference on the Internet and Society, Boston,USA,
<http://www.sscnet.ucla.edu/soc/faculty/kollock/papers/design.htm>.

Korfiatis, N.T., Poulos, M. & Bokos, G. 2006, 'Evaluating authoritative sources using social
networks: an insight from Wikipedia', Online Information Review, vol. 30, no. 3, pp.
252-262.

Kozinets, R.V. 1999, 'E-Tribalized Marketing?: The Strategic Implications of Virtual Com-
munities of Consumption', European Management Journal, vol. 17, no. 3, pp. 252–
264.

Lakhani, K.R. & Panetta, J.A. 2007, 'The Principles of Distributed Innovation', Innovations,
vol. 2, no. 3, pp. 97-112.

Roman Pickl | The Performance of Online Communities | 43 |


Leimeister, J.M., Sidiras, P. & Krcmar, H. 2006, 'Exploring Success Factors of Virtual Com-
munities: The Perspectives of Members and Operators', Journal of Organizational
Computing and Electronic Commerce, vol. 16, no. 3&4, pp. 277–298.

Licklider, J.C.R. & Taylor, R.W. 1968, 'The Computer as a Communication Device', Science
and Technology, pp. 21-41.

Lih, A. 2004, 'Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluat-
ing collaborative media as a news resource', paper presented to the 5th International
Symposium on Online Journalism, University of Texas at Austin, USA, April 16-17,
2004.

Ludford, P.J., Cosley, D., Frankowski, D. & Terveen, L. 2004, 'Think different: increasing
online community participation using uniqueness and group dissimilarity', SIGCHI
conference on Human factors in computing systems, ACM, Vienna, Austria, pp. 631-
638, <http://grouplens.org/papers/pdf/thinkdifferent-chi2004.pdf>.

Manning, C.D. & Schütze, H. 2003, Foundations of Statistical Natural Language Processing,
MIT Press, Cambridge,MA.

Mateos Garcia, J. & Steinmueller, W.E. 2003, 'Applying the open source development model
to knowledge work.' INK Open Source Research Working Paper No. 2,
<http://www.sussex.ac.uk/Units/spru/publications/imprint/sewps/sewp94/sewp94.pdf>

Mockus, A., Fielding, R.T. & Herbsleb, J. 2000, 'A Case Study of Open Source Software De-
velopment: The Apache Server', The 22th International Conference on Software Engi-
neering, Limerick, Ireland, <http://mockus.us/papers/apache.pdf>.

Mynatt, E.D., O'Day, V.L., Adler, A. & Ito, M. 1998, 'Network Communities: Something
Old, Something New, Something Borrowed . . .' Computer Supported Cooperative
Work (CSCW), vol. 7, no. 1-2, pp. 123-156.

Nambisan, S. 2002, 'Designing Virtual Customer Environments for New Product Develop-
ment: Toward a Theory', Academy of Management Review, vol. 27, no. 3, pp. 392-
413.

Roman Pickl | The Performance of Online Communities | 44 |


Nonnecke, B. & Preece, J. 1999, 'Shedding Light on Lurkers in Online Communities', Ethno-
graphic Studies in Real and Virtual Environments: Inhabited Information Spaces and
Connected Communities, Edinburgh, pp. 123-128,
<http://www.ifsm.umbc.edu/~preece/paper/16%20Shedding%20Light.final.pdf>.

Ortega, F., Gonzalez-Barahona, J.M. & Robles, G. 2007, 'The Top Ten Wikipedias: A Quanti-
tative Analysis Using WikiXRay ', ICSOFT, Barcelona, Spain, pp. 46-53,
<http://libresoft.es/oldsite/downloads/C4_159_Ortega.pdf>.

Ortega, F., Gonzalez-Barahona, J.M. & Robles, G. 2008, 'On the Inequality of Contributions
to Wikipedia', 41st Annual Hawaii International Conference on System Sciences
Honolulu, Hawaii, p. 304, <http://libresoft.es/downloads/Ineq_Wikipedia.pdf>.

Preece, J. 2001, 'Sociability and usability in online communities: determining and measuring
success', Behaviour & Information Technology, vol. 20, no. 5, pp. 347-356.

Preece, J. & Maloney-Krichmar, D. 2005, 'Online Communities: Design, Theory, and Prac-
tice', Journal of Computer-Mediated Communication, vol. 10, no. 4, p. article 1.

Preece, J., Maloney-Krichmar, D. & Abras, C. 2003, History and emergence of online com-
munities, Berkshire Publishing Group, Sage,
<http://www.ifsm.umbc.edu/~preece/paper/6%20Final%20Enc%20preece%20et%20a
l.pdf>.

Rashid, A.M., Ling, K., Tassone, R.D., Resnick, P., Kraut, R. & Riedl, J. 2006, 'Motivating
Participation by Displaying the Value of Contribution', CHI 2006, ACM, Montréal,
Québec, Canada, pp. 955-
958<http://www.si.umich.edu/~presnick/papers/CHI06/rashidAl.pdf>.

Raymond, E.S. 2000, The Cathedral and the Bazaar (Electronic Version), viewed 07.12.
2008, <http://www.catb.org/~esr/writings/cathedral-bazaar/cathedral-
bazaar/ar01s04.html>.

Rheingold, H. 1993, The Virtual Community (Electronic Version), viewed 07.11.2008


<http://www.rheingold.com/vc/book/>.

Roman Pickl | The Performance of Online Communities | 45 |


Ridings, C.M., Gefen, D. & Arinze, B. 2002, 'Some antecedents and effects of trust in virtual
communities', Journal of Strategic Information Systems, vol. 11, pp. 271-295.

Schoberth, T., Preece, J. & Heinzl, A. 2003, 'Online Communities: A Longitudinal Analysis
of Communication Activities', 36th Annual Hawaii International Conference on Sys-
tem Sciences, Big Island, Hawaii,
<http://www.ifsm.umbc.edu/~preece/paper/9%20HICSSNOCD06v2.pdf>.

Senyard, A. & Michlmayr, M. 2004, 'How to Have a Successful Free Software Project', 11th
Asia-Pacific Software Engineering Conference (APSEC’04), Busan, Korea,
<http://kb.cospa-project.org/retrieve/2450/senyardmichlmay.pdf>.

Stewart, D. 2005, 'Social Status in an Open-Source Community', American Sociological Re-


view, vol. 70, no. 5, pp. 823-842.

Stvilia, B., Twidale, M.B., Smith, L.C. & Gasser, L. 2005, 'Assessing information quality of a
community-based encyclopedia ', International Conference on Information Quality,
Cambridge,England, pp. 442-454,
<http://www.isrl.uiuc.edu/~stvilia/papers/quantWiki.pdf>.

Surowiecki, J. 2005, The Wisdom of Crowds, Anchor Books, New York.

Tapscott, D. & Williams, A.D. 2008, Wikinomics: How Mass Collaboration Changes Every-
thing, Penguin Group, New York.

Van Alystyne, M. & Brynjolfsson, E. 2005, 'Global Village or Cyber-Balkans? Modeling and
Measuring the Integration of Electronic Communities', Management Science, vol. 51,
no. 6, pp. 851-868.

Viégas, F.B., Wattenberg, M. & Dave, K. 2004, 'Studying Cooperation and Conflict between
Authors with history flow Visualizations', SIGCHI conference on Human factors in
computing systems, vol. 6, ACM, Vienna,Austria, pp. 575-
582<http://alumni.media.mit.edu/~fviegas/papers/history_flow.pdf>.

Viégas, F.B., Wattenberg, M., Kriss, J. & Ham, F.v. 2007, 'Talk Before You Type: Coordina-
tion in Wikipedia', 40th Hawaii International Conference on System Sciences, Hono-

Roman Pickl | The Performance of Online Communities | 46 |


lulu, Hawaii, USA,
<http://www.research.ibm.com/visual/papers/wikipedia_coordination_final.pdf>.

von Hippel, E. 2001, 'Innovation by User Communities: Learning from Open-Source Soft-
ware', MIT Sloan Management Review, vol. 42, no. 4, pp. 82-86.

von Hippel, E. 2005, Democratizing Innovation (Electronic Version), viewed 07.11.2008,


<http://web.mit.edu/evhippel/www/books/DI/DemocInn.pdf>.

von Krogh, G., Spaeth, S. & Lakhani, K.R. 2003, 'Community, joining, and specialization in
open source software innovation: a case study', Research Policy, vol. 32, pp. 1217-
1241.

Voss, J. 2005, 'Measuring Wikipedia', paper presented to the International Conference of the
International Society for Scientometrics and Informetrics : 10th, Stockholm (Sweden),
24-28 July 2005,<http://eprints.rclis.org/3610/1/MeasuringWikipedia2005.pdf>.

Wanga, Y. & Fesenmaier, D.R. 2004, 'Towards understanding members’ general participation
in and active contribution to an online travel community', Tourism Management, vol.
25, pp. 709–722.

Wilkinson, D.M. & Huberman, B.A. 2007, 'Assessing the value of cooperation in Wikipedia',
First Monday, vol. 12, no. 4.

Williams, R.L. & Cothrel, J. 2000, 'Four Smart Ways to Run Online Communities', Sloan
Management Review, vol. 41, no. 4, pp. 81-91.

Internet Sources:
Alexa.com 2008, Traffic Details - wikipedia.org, viewed 10.11.2008
<http://www.alexa.com/data/details/traffic_details/wikipedia.org>.

Döring, N. 2001, Virtuelle Gemeinschaften als Lerngemeinschaften!?, viewed 07.11.2008


<http://www.die-frankfurt.de/zeitschrift/32001/positionen4.htm>.

Encyclopædia Britannica, I. 2006, Fatally Flawed - Refuting the recent study on encyclopedic
accuracy by the journal Nature, viewed 16.04.2009
<http://corporate.britannica.com/britannica_nature_response.pdf>.

Roman Pickl | The Performance of Online Communities | 47 |


Gude, A. 2008, wikipedia-article-exporter, viewed 16.10.2008
<http://code.google.com/p/wikipedia-article-exporter/>.

MediaWiki.org 2009, MWDumper, viewed 16.04.2009


<http://www.mediawiki.org/w/index.php?title=MWDumper&oldid=242629>.

O'Reilly, T. 2005, What Is Web 2.0, viewed 16.03.2009


<http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html>.

stats.grok.se 2009, Wikipedia article traffic statistics, viewed 28.03.2009


<http://stats.grok.se/>.

stilversprechend.de 2009a, stilversprechend, viewed 17.04.2009


<http://www.stilversprechend.de/stil/index.html>.

stilversprechend.de 2009b, Was ist der Flesch-Wert, viewed 17.04.2009


<http://www.stilversprechend.de/stil/fleschwert.html>.

Swartz, A. 2006, Raw Thought: Who Writes Wikipedia?, viewed 27.03.2009


<http://www.aaronsw.com/weblog/whowriteswikipedia>.

Wales, J. 2005a, The Intelligence of Wikipedia, Oxford Internet Institute, viewed 27.03.2009
<http://webcast.oii.ox.ac.uk/?ID=20050711_76&view=Webcast>.

Wales, J. 2005b, Wikipedia is an encylopedia, viewed 12.03.2009


<http://lists.wikimedia.org/pipermail/wikipedia-l/2005-March/020469.html>.

Wales, J. 2005c, Wikipedia, Emergence, and The Wisdom of Crowds, viewed 27.03.2009
<http://lists.wikimedia.org/pipermail/wikipedia-l/2005-May/021764.html>.

Wikimedia 2008a, dewiki dump progress on 20080607, viewed 03.08.2008


<http://download.wikimedia.org/dewiki/20080607/>.

Wikimedia 2008b, enwiki dump progress on 20080312, viewed 28.03.2009


<http://download.wikimedia.org/enwiki/20080312/>.

Wikimedia 2008c, enwiki dump progress on 20080524, viewed 28.03.2009


<http://download.wikimedia.org/enwiki/20080524/>.

Roman Pickl | The Performance of Online Communities | 48 |


Wikimedia 2008d, List of Wikipedias, viewed 10.11.2008
<http://meta.wikimedia.org/w/index.php?title=List_of_Wikipedias&oldid=1267871>.

Wikimedia 2009, Wikimedia Downloads, viewed 27.03.2009


<http://download.wikimedia.org/>.

Wikipedia 2009a, Archiv/WP 1.0, viewed 29.04.2009


<http://de.wikipedia.org/w/index.php?title=Wikipedia:Archiv/WP_1.0&oldid=303126
94>.

Wikipedia 2009b, Benutzerverzeichnis, viewed 16.10.2008


<http://de.wikipedia.org/w/index.php?title=Spezial%3ABenutzer&group=bot>.

Wikipedia 2009c, Bots, viewed 16.04.2009


<http://en.wikipedia.org/w/index.php?title=Wikipedia:Bots&oldid=283931567>.

Wikipedia 2009d, Categorization, viewed 12.04.2009


<http://en.wikipedia.org/w/index.php?title=Wikipedia:Categorization&oldid=2828195
65#Categorizing_pages>.

Wikipedia 2009e, Featured articles, viewed 12.04.2009


<http://en.wikipedia.org/w/index.php?title=Wikipedia:Featured_articles&oldid=28342
9397>.

Wikipedia 2009f, Hilfe:Kategorien, viewed 17.04.2009


<http://de.wikipedia.org/w/index.php?title=Hilfe:Kategorien&oldid=58900978>.

Wikipedia 2009g, Mediawiki API documentation page, viewed 28.03.2009


<http://de.wikipedia.org/w/api.php>.

Wikipedia 2009h, Revision history of Virtual community, viewed 23.04.2009


<http://en.wikipedia.org/w/index.php?title=Virtual_community&action=history>.

Wikipedia 2009i, Size of Wikipedia, viewed 20.05.2009


<http://en.wikipedia.org/w/index.php?title=Wikipedia:Size_of_Wikipedia&oldid=291
163424>.

Roman Pickl | The Performance of Online Communities | 49 |


Wikipedia 2009j, Statistics, viewed 28.04.2009
<http://en.wikipedia.org/w/index.php?title=Wikipedia:Statistics&oldid=282205700>.

Wikipedia 2009k, Version 1.0 Editorial Team/Assessment, viewed 29.04.2009


<http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Assessment>.

Wikipedia 2009l, Welcome to Wikipedia, viewed 14.04.2009


<http://en.wikipedia.org/w/index.php?title=Main_Page&oldid=273421236>.

Wikipedia 2009m, Who is responsible for these pages, viewed 14.04.2009


<http://en.wikipedia.org/w/index.php?title=Wikipedia:Editorial_oversight_and_contro
l&oldid=283640875#User_collaborative_knowledge-building>.

Wikipedia 2009n, Wikimedia projects, viewed 27.03.2009


<http://en.wikipedia.org/w/index.php?title=Wikimedia_Foundation&oldid=28004568
3#Wikimedia_projects>.

Wikipedia 2009o, Wikipedia:Barnstars, viewed 28.03.2009


<http://en.wikipedia.org/w/index.php?title=Wikipedia:Barnstars&oldid=279034552>.

Wikipedia 2009p, Wikipedia:Namespace, viewed 16.04.2009


<http://en.wikipedia.org/w/index.php?title=Wikipedia:Namespace&oldid=275699788
>.

Wikipedia 2009q, Wikipedia:Talk page, viewed 27.03.2009


<http://en.wikipedia.org/w/index.php?title=Wikipedia:Talk_page&oldid=277750073>
.

Wikipedia 2009r, Wikipedia:User access levels, viewed 27.03.2009


<http://en.wikipedia.org/w/index.php?title=Wikipedia:User_access_levels&oldid=279
622212>.

Wikipedia 2009s, Wikipedia:User page, viewed 27.03.2009


<http://en.wikipedia.org/w/index.php?title=Wikipedia:User_page&oldid=279760325>
.

Roman Pickl | The Performance of Online Communities | 50 |


Wikipedia 2009t, WikiProject, viewed 28.04.2009
<http://en.wikipedia.org/w/index.php?title=Wikipedia:WikiProject&oldid=286400532
>.

Wikipedia 2009u, WikiProject Vandalism studies, viewed 20.05.2009


<http://en.wikipedia.org/w/index.php?title=Wikipedia_talk:WikiProject_Vandalism_st
udies&oldid=291203033>.

Wikipedia 2009v, Wikitext, viewed 16.04.2009


<http://en.wikipedia.org/w/index.php?title=Wikitext&oldid=283256384>.

Wikipedia 2009w, Zusammenfassung und Quellen, viewed 16.04.2009


<http://de.wikipedia.org/w/index.php?title=Hilfe:Zusammenfassung_und_Quellen&ol
did=58724557#Auto-Zusammenfassung>.

Roman Pickl | The Performance of Online Communities | 51 |


7 Appendix

7.1 Figures and Examples

XML representation of a Wikipedia article [1]

Online representation of a Wikipedia article [2]

Roman Pickl | The Performance of Online Communities | 52 |


HTML representation of a Wikipedia article [2]

7.2 Data Collection

7.2.1 Setting up the Database and Drawing the Sample

Instead of installing the whole MediaWiki software it was decided to setup the required data-
base scheme using the tables.sql file, which can be found in the MediaWiki repository [3].

The XML-dump from June 7th of the German Wikipedia was downloaded [4] and converted
into a SQL file using the MWDumper-tool [5]:

java -jar mwdumper-2008-04-13.jar --output=file:dump.sql --format=sql:1.5 dewiki-20080607-stub-meta-


history.xml.gz

The yielded file was then imported into a MySQL database:

mysql -u username –ppassword --database=dbname --force --default-character-set=utf8 < dump.sql

Two tables in the dump seemed especially important for this study: The page-table including
all pages in Wikipedia and the revision-table including every single revision of each page. As
these tables were out of sync (the revision table included more distinct pages than the page
table), it was decided to use the revision table as a basis for the analysis.

As this study aimed to analyze articles created in 2007, all revisions in 2007 were extracted
from the revision table in a first step and indices were added to speed up queries from this
table.

create table revision2007 as select * from revision where extract(year from rev_timestamp)=2007;
alter table revision2007 add index (rev_page);
alter table revision2007 add index (rev_user_text);

In a next step a table was created to depict which pages were edited by which user in 2007
and how often. Again, several indices were created to improve query performance:

Roman Pickl | The Performance of Online Communities | 53 |


create table userpages2007 as select rev_page,rev_user_text,count(*) from revision2007 group by
rev_page,rev_user_text Order by Null;
alter table userpages2007 add index (rev_user_text);
alter table userpages2007 add index (rev_page);
alter table userpages2007 add index (rev_page,rev_user_text);

To flag bots a column was added to this table…:

alter table userpages2007 add column bot boolean;

…an up-to-date user-group assignment list received from [6] and updated with the help of the
following short Python script:

import urllib
import re

class AppURLopener(urllib.FancyURLopener): #set user agent to firefox


version= "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"

#get bot list


AppURLopenerinstance=AppURLopener()
#adapt to wkiversion
botlist=AppURLopenerinstance.open("http://de.wikipedia.org/w/index.php?title=Spezial%3ABenutzer&userna
me=&group=bot&limit=5000")# make sure that all bots are included in this query
global_botlistcontent=botlist.read()
botlist.close()

bots=re.findall(""">([^<]*)</a> \xe2\x80\x8e\(<a href="/wiki/Wikipedia:Bots""",global_botlistcontent)


import MySQLdb
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)

for bot in bots:


print bot
cursor.execute ("update userpages2007 set bot=1 where rev_user_text=%s",(bot,))
result = cursor.fetchall()

Due to special characters used in its name a bot had to be flagged by hand. What is more, all
other bot values were set to 0 and a ‘bot’ table including all bots was created:

update userpages2007 set bot=1 where rev_user_text="L&K-Bot";


update userpages2007 set bot=0 where bot IS NULL;
create table bots as select distinct(rev_user_text) from userpages2007 where bot=1;

After that, columns for the page title and page namespace were added to the table…

alter table userpages2007 add column page_title varchar(255);


alter table userpages2007 add column page_namespace int(11);

…and filled with values from the page-table:

Roman Pickl | The Performance of Online Communities | 54 |


update userpages2007,page set userpages2007.page_title=page.page_title where user-
pages2007.rev_page=page.page_id;
update userpages2007,page set userpages2007.page_namespace=page.page_namespace where user-
pages2007.rev_page=page.page_id;

Missing values were queried from the Wikipedia API and missing pages flagged with the fol-
lowing Python script:

import urllib
import re
import xml.etree.cElementTree as cElementTree
import time

class AppURLopener(urllib.FancyURLopener): #set user agent to firefox


version= "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"

import MySQLdb
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)

AppURLopenerinstance=AppURLopener()
numberofmissingvalues=100
for i in range (0,numberofmissingvalues/50): #it is only possible to query 50 values from the api without bot
status at once
cursor.execute ("select distinct(rev_page) from userpages where page_title is NULL limit 50")#create batches
of 50
resultpage = cursor.fetchall()
print resultpage
print
stringliste=[] #create query string for api
for page in resultpage:
stringliste.append(str(page['rev_page']))
fertigerstring="|".join(stringliste)
print fertigerstring

##get api content

liste=AppURLopenerinstance.open("http://de.wikipedia.org/w/api.php?action=query&pageids="+fertigerstring+
"&format=xml") #get api results

for event, elem in cElementTree.iterparse(liste):


if elem.tag=='page':

if elem.attrib.has_key('missing'):
print "missing"
cursor.execute ("update userpages set page_title=%s, page_namespace=%s where
rev_page=%s",("!missing","999",elem.attrib['pageid']))#flag missing pages
else:
print "---"
print elem.attrib['pageid']
print elem.attrib['ns']
print elem.attrib['title']

Roman Pickl | The Performance of Online Communities | 55 |


print
cursor.execute ("update userpages set page_title=%s, page_namespace=%s where
rev_page=%s",(elem.attrib['title'].encode("latin-1"),elem.attrib['ns'],elem.attrib['pageid'])) #changed from utf-8
cursor.execute ("select * from userpages where rev_page=%s limit 1",(elem.attrib['pageid']))
resultpage = cursor.fetchall()
print resultpage

time.sleep(5)# sleep 5 seconds to reduce load on server

Then a table with all articles in the main namespace (namespace 0; see [7] for more details)
edited in 2007, which was not only edited by bots was created:

create table pagelist2007 as select distinct(rev_page),page_title,page_namespace from userpages2007 where


page_namespace=0 and bot=0;
alter table pagelist2007 add index (rev_page);

It turned out that an easier way of removing all pages not in namespace 0 would probably
have been to use the –filter option of the MwDumper-tool (see [5] for more details).

Next, a column depicting the date of creation of each article was created…

alter table pagelist2007 add column creationdate binary(14);

… and populated with the help of the following Python script:

import MySQLdb
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)

cursor.execute ("select * from pagelist2007 where creationdate is Null")


pages = cursor.fetchall()
for page in pages:
cursor.execute ("select min(rev_timestamp) from revision where rev_page=%s;",(page["rev_page"],)) # get
date of first edit=creation
result = cursor.fetchone()
cursor.execute ("UPDATE pagelist2007 SET creationdate = %s where
rev_page=%s",(result["min(rev_timestamp)"],page["rev_page"],))

Subsequently a column for the number of contributors in 2007 was added and…

alter table pagelist2007 add column user2007 int(11);

…the number of users in 2007 without bots calculated:

import MySQLdb
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")

Roman Pickl | The Performance of Online Communities | 56 |


cursor = conn.cursor (MySQLdb.cursors.DictCursor)

cursor.execute ("select rev_page from pagelist2007 where user2007 is Null")


pages = cursor.fetchall()

for page in pages:

cursor.execute ("select count(distinct rev_user_text) as contributors from revision2007 where rev_page=%s


and rev_user_text not in(select rev_user_text from bots)",(page["rev_page"],)) #select users 2007 without bots
result = cursor.fetchone()
cursor.execute ("UPDATE pagelist2007 SET user2007 = %s where rev_page=%s",(result["contributors
"],page["rev_page"],))

A random sample of 5000 articles with more than 1 user created in 2007 was drawn from this
table and stored in the samplepagescreated2007 table.

create table samplepagescreated2007 as select * from pagelist2007 where extract(year from creationdate)=2007
and user2007>1 order by rand() limit 5000;

Finally columns for each variable and a column indicating whether the article was already
analyzed were added to the table:

alter table samplepagescreated2007 add column words int(10) unsigned;# for nr. of words
alter table samplepagescreated2007 add column wordpool int(10) unsigned;# for vocabulary
alter table samplepagescreated2007 add column uniquesumlinks int(10) unsigned;#for unique total links

alter table samplepagescreated2007 add column fleschd int(10) unsigned; #for flesch readability score
alter table samplepagescreated2007 add column categoriescalc int(10) unsigned;# for nr. of categories

alter table samplepagescreated2007 add column userscalc int(10) unsigned; #for community-size and interactiv-
ity
alter table samplepagescreated2007 add column heterogeneitycalc double unsigned; #for heterogeneity

alter table samplepagescreated2007 add column avgactivity double unsigned; #for activity
alter table samplepagescreated2007 add column avgfocus double unsigned; #for focus

alter table samplepagescreated2007 add column interactionscalc int(10) unsigned; #for calculating interactivity
alter table samplepagescreated2007 add column editscalc int(10) unsigned; #for calculating interactivity
alter table samplepagescreated2007 add column mediantimebetweenedits double unsigned; #for calculating
dynamics

alter table samplepagescreated2007 add column analysed boolean; #set to 1 if article was already analyzed

Roman Pickl | The Performance of Online Communities | 57 |


7.2.2 Parsing the Data and Calculating all Variables

The following sections explain the process of calculating all variables from the following
three sources:

 XML file of each article

 HTML file of each article

 The MySQL database created

7.2.2.1 Retrieval and Analyses of the XML Representation of each Article

The following script queries the title of each article from the Wikipedia API and downloads
the article’s XML file from Wikipedia with the help of an adapted version of the getwiki
script by Gude [9] (all links to the English Wikipedia were replaced by the respective links to
the German Wikipedia). The file is stored to a folder and parsed. In a first step vandals are
flagged. After that the XML file is parsed again to calculate the number of users (excluding
bots and vandals), categories, interactions and edits. In case of any problems, the article is
flagged with a problem code (1: article is a redirect, 2: rev_page id of downloaded article file
does not match rev_page in database, 3: redirect & ids do not match, 4: article not found via
API) and excluded from further analysis.

# -*- coding: cp1252 -*-


#### utf-8
import xml.etree.cElementTree as ElementTree
import datetime
import time
import urllib
import re
import numpy
import degetwiki #importing getwiki by Alexander Gude Version 1.1, adapted to the German Wikipedia
import os
import pylab
import pickle
import math

import MySQLdb
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)

class AppURLopener(urllib.FancyURLopener):
version = "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"

Roman Pickl | The Performance of Online Communities | 58 |


AppURLopenerinstance=AppURLopener()

def get_botlist():
if os.path.exists("botlist.txt"):
#print "botfile found"
botfile = file('botlist.txt', 'r')
botlist=pickle.load(botfile) #read from file
else:
botlist=[] #pickled and unpickled
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)
cursor.execute ("select * from bots;") #get bots from database
bots = cursor.fetchall()
for bot in bots:
botlist.append(bot['rev_user_text'])
#save to file
botfile = file('botlist.txt', 'w')
pickle.dump(botlist,botfile) #store botfile

botfile.close #close botfile in both cases

return botlist

def analysearticle(doc,rev_page):
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)
redirect=False #is article a redirect?
problem=0 #is there a problem?
filecelement=open(doc, "r") #open xml file
revcounter=0#revision counter

newerthan2007=False
articleidfound=False # there are several id fields (article,revision,user)
vandalism=set()
vandalismuser=set()
upperlimit=datetime.datetime(2008, 1, 1)#
wrongarticle=False

#get botlist
bots=get_botlist()
#print bots

for event, elem in ElementTree.iterparse(filecelement): #first run, flag vandalism

if elem.tag=="fusername" or elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}ip":
currentusernameorip=elem.text

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}timestamp":
currentzeit=elem.text
if datetime.datetime.strptime(elem.text, "%Y-%m-%dT%H:%M:%SZ")>=upperlimit:
newerthan2007=True

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}title":
articletitle=elem.text

Roman Pickl | The Performance of Online Communities | 59 |


if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}id":
if not articleidfound:
articleid=elem.text
if str(articleid)!=rev_page: #ids in database and id in downloaded article don't match
wrongarticle=True
articleidfound=True

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}comment":
if not newerthan2007:
if elem.text=="[[Hilfe:Zusammenfassung und Quelle#Auto-Zusammenfassung|AZ]]: Der Seiteninhalt
wurde durch einen anderen Text ersetzt." or elem.text=="[[Hilfe:Zusammenfassung und Quelle#Auto-
Zusammenfassung|AZ]]: Die Seite wurde geleert.": #potential vandalism detected see
http://de.wikipedia.org/wiki/Hilfe:Zusammenfassung_und_Quelle
vandalism.add(revcounter)
vandalismuser.add("'"+currentusernameorip+"'")

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}text":
if not newerthan2007:
if revcounter not in vandalism:
if elem.text:
if "#redirect[[" in elem.text.lower() or "#redirect [[" in elem.text.lower() or "#weiterleitung[[" in
elem.text.lower() or "#weiterleitung [[" in elem.text.lower(): #redirect?
redirect=True#last revision includes redirect
else:
redirect=False
else: #if everything is deleted there's no text in the text element
vandalism.add(revcounter)
vandalismuser.add("'"+currentusernameorip+"'")
revcounter+=1

if wrongarticle==False and redirect==False: #ids fit,no redirect


newerthan2007=False
revcounter=0
userset=set()
filecelement =open(doc, "r")
olduser=""
interactions=0
edits=0

for event, elem in ElementTree.iterparse(filecelement): #calculate vectors

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}username" or
elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}ip":
currentusernameorip=elem.text #store username or ip

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}timestamp":
currentzeit=elem.text
if datetime.datetime.strptime(elem.text, "%Y-%m-%dT%H:%M:%SZ")>=upperlimit
newerthan2007=True #flag revisions after 2007

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}text":
if not newerthan2007:
if revcounter not in vandalism and currentusernameorip not in bots:
edits+=1
userset.add(currentusernameorip)
categories=len(re.findall("\[\[Kategorie:(.*)]]",elem.text))
if not currentusernameorip == olduser:
interactions+=1

Roman Pickl | The Performance of Online Communities | 60 |


olduser=currentusernameorip
revcounter+=1
userscalc=len(userset)
cursor.execute ("UPDATE samplepagescreated2007 SET interaction-
scalc=%s,userscalc=%s,categoriescalc=%s, editscalc=%s WHERE rev_page =
%s",(interactions,userscalc,categories,edits,rev_page,))

else: #ids don't fit! or redirect


if redirect==True and wrongarticle==True:#redirect and wrongid
problem=3
if wrongarticle==True and redirect==False:#wrongid
problem=2
if redirect==True and wrongarticle==False:#redirect
problem=1

cursor.execute ("UPDATE samplepagescreated2007 SET analysed = 1,problem=%s WHERE rev_page =


%s",(problem,rev_page,)) #do for all
cursor.close ()
conn.close ()

def main(article,offset,rev_page):
wikiheader="""<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd"
version="0.3" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.14alpha</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2">Media</namespace>
<namespace key="-1">Special</namespace>
<namespace key="0" />
<namespace key="1">Talk</namespace>
<namespace key="2">User</namespace>
<namespace key="3">User talk</namespace>
<namespace key="4">Wikipedia</namespace>
<namespace key="5">Wikipedia talk</namespace>
<namespace key="6">Image</namespace>
<namespace key="7">Image talk</namespace>
<namespace key="8">MediaWiki</namespace>
<namespace key="9">MediaWiki talk</namespace>
<namespace key="10">Template</namespace>
<namespace key="11">Template talk</namespace>
<namespace key="12">Help</namespace>
<namespace key="13">Help talk</namespace>
<namespace key="14">Category</namespace>
<namespace key="15">Category talk</namespace>
<namespace key="100">Portal</namespace>
<namespace key="101">Portal talk</namespace>
</namespaces>
</siteinfo>
<page>
"""
#download article
if os.path.exists(rev_page+".xml"):
print "article xml file found, using this version. Delete old version to trigger download"
else:
if offset !=1:

Roman Pickl | The Performance of Online Communities | 61 |


deget-
wiki.downloadArticles(articlename=article,filename=rev_page+"_temp.xml",verbose=True,split=False,offset=of
fset)
#create new file, add header, add content
filehandler = open(rev_page+'.xml', 'w')
filehandler.write(wikiheader) #add header to file
filehandler.write(open(rev_page+"_temp.xml").read())
filehandler.close()
else: #download whole article
deget-
wiki.downloadArticles(articlename=article,filename=rev_page+".xml",verbose=True,split=False,offset=offset)
os.remove(rev_page+"_temp.xml")#remove temporary file
analysearticle(doc=rev_page+".xml",rev_page=rev_page) #analyze article

pagequery=cursor.execute ("select * from samplepagescreated2007 where analyzed is Null order by rev_page")


#get all pages not analyzed yet
pages = cursor.fetchall()
for page in pages:
missing=0
article=""
if not os.path.exists(str(page['rev_page'])+".xml"): #don't querry if articlexml exists
apil-
ist=AppURLopenerinstance.open("http://de.wikipedia.org/w/api.php?action=query&pageids="+str(page['rev_pa
ge'])+"&format=xml") #get name from api
#print apilist
for event, elem in ElementTree.iterparse(apilist):
if elem.tag=='page':
if elem.attrib.has_key('missing'):
print "missing"
missing=1
else:
print "---"
article=elem.attrib['title'].encode("utf-8")
print article
if missing ==0: #if article found
main(article,rev_page=str(page['rev_page']),offset="2007-01-01T00:00:00Z")
else: #if article was not found, set problem to 4 and analyzed to 1
cursor.execute ("UPDATE samplepagescreated2007 SET analysed = 1,problem=%s WHERE rev_page =
%s",(4,randompage['rev_page'],)) #do for all

Due to a number of missing pages and redirects in the sample a second draw was necessary:

create table samplepagescreated2nddraw as select * from resultspagescreated2007 where user2007>1 order by


rand() limit 200; #draw additional articles

delete from samplepagescreated2nddraw where rev_page in (Select rev_page from samplepagescreated2007);


#delete duplicates

insert into samplepagescre-


ated2007(rev_page,page_title,page_namespace,creationdate,user2007,user,edits2007,edits) select
rev_page,page_title,page_namespace,creationdate,user2007,user,edits2007,edits from samplepagescre-
ated2nddraw; #insert into sample table

drop table samplepagescreated2nddraw; #drop the table

Roman Pickl | The Performance of Online Communities | 62 |


7.2.2.2 Retrieval and Analyses of the HTML Representation of each Article

To calculate the number of words and the Flesch readability score another Python script was
developed. The function getlastid examines the XML representation of an article and returns
the revision number of the last revision in 2007. Subsequently, the HTML version of this re-
vision is obtained from Wikipedia using specific parameters explained in [8]. The article con-
tent is extracted, some clean-ups conducted, HTML markup stripped and the plain text is send
to [10] for analysis. The Flesch readability scores are extracted from the results and inserted
into the MySQL database.

# -*- coding: cp1252 -*-

import urllib
import urllib2
import re
import xml.etree.cElementTree as ElementTree
import MySQLdb
import datetime
import time

def strip_tags(value):
"Return the given HTML with all tags stripped."
return re.sub(r'<[^>]*?>', '', value)

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'


headers = { 'User-Agent' : user_agent }

def getlastid(doc,rev_page): #find out id of last revision in 2007


filecelement =open(doc, "r") #assumption doc already exists!
newerthan2007=False
articleidfound=False # there are several id fields (article,revision,user)
upperlimit=datetime.datetime(2008, 1, 1)#
inrevision=False
revisionid=0

for event, elem in ElementTree.iterparse(filecelement):

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}revision":
inrevision=True

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}timestamp":
currentzeit=elem.text
if datetime.datetime.strptime(elem.text, "%Y-%m-%dT%H:%M:%SZ")>=upperlimit#if newer than 2007
newerthan2007=True
else:
lastrevisionid=revisionid #as the timestamp tag comes after the revision tag, assign here

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}id":
if inrevision==True:
revisionid=elem.text
inrevision=False

Roman Pickl | The Performance of Online Communities | 63 |


print lastrevisionid

return lastrevisionid

conn = MySQLdb.connect (host = "127.0.0.1",


user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor) #connect to db

pagequery=cursor.execute ("select rev_page from samplepagescreated2007 where problem=0 and words is


Null") #
pages = cursor.fetchall()
for page in pages:
rev_page=str(page['rev_page'])
lastrevisionid=getlastid(doc=rev_page+".xml",rev_page=rev_page)

##
url ="http://de.wikipedia.org/w/index.php"

values = {'title' : "", "curid" : rev_page, "oldid" : lastrevisionid}

data = urllib.urlencode(values)
print data
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read() #get last revision with specific id

#do some cleanups:


content=the_page[the_page.index("<!-- start content -->"):the_page.index("<!-- end content -->")] #extract
article content
content=re.sub("""<div class="printfooter">\n.*</div>""","",content)
content=content.replace("""<script type="text/javascript">\n//<![CDATA[\n if (window.showTocToggle) {
var tocShowText = "Anzeigen"; var tocHideText = "Verbergen"; showTocToggle(); } \n//]]>\n</script>\n""","")
stripped_content=strip_tags(content.decode("utf-8").replace("&#160;"," ")) #strip tags
stripped_content=re.sub("Eine gesichtete Version dieser Seite.*, basiert auf dieser Versi-
on.","",stripped_content)
stripped_content=stripped_content.replace("[Bearbeiten]","")
stripped_content=stripped_content.replace("[Verbergen]","")
stripped_content=stripped_content.replace(u"&#32;"," ")
stripped_content=stripped_content.replace(u"&amp;","&")

#connect to stilverstprechend.de
url2 = 'http://www.stilversprechend.de/stil/bericht.html'#index.html'
values2 = {'text' : stripped_content.encode("cp1252","ignore")}
data2 = urllib.urlencode(values2)
req2 = urllib2.Request(url2, data2, headers)
response2 = urllib2.urlopen(req2)
the_page2 = response2.read()

#extract number of sentences, words, syllables, characters


x2=re.search("""Ihr Text besteht aus <b>(?P<Saetze>\d*)
</b> Sätzen, <b>(?P<Woerter>\d*)</b>
W&ouml;rtern, <b>(?P<Silben>\d*)</b>
Silben und <b>(?P<Zeichen>\d*)</b>
Zeichen.""",the_page2)

print "Sätze: ",x2.group("Saetze")

Roman Pickl | The Performance of Online Communities | 64 |


print "Wörter: ",x2.group("Woerter")
print "Silben: ",x2.group("Silben")
print "Zeichen: ",x2.group("Zeichen")

#extract flesch readablility score


x3=re.search("""Der <a href="/stil/fleschwert.html;jsessionid=\S*">Flesch-Wert</a> liegt bei <b>
(?P<FleschWert>\d*)</b>.""",the_page2)
try:
print "Flesch-ValueGerman",x3.group("FleschWert") #<---- important! #if there are no sentences, there is
no Flesch
cursor.execute ("UPDATE samplepagescreated2007 fleschd =%s WHERE rev_page = %s",(
x3.group("FleschWert"),rev_page,)) #update all
except AttributeError: #no flesch value
print "no flesch"
print 10*"-"

time.sleep(5)#sleep to reduce load on server

#save plain text to txt file


savetextfilehandler=open(str(rev_page)+"txt.txt","w")
savetextfilehandler.write(stripped_content.encode("cp1252","ignore"))
savetextfilehandler.close()

The following script was used to calculate the number words and the number of unique words
(vocabulary) in each article. The plain text version of each article stored before is loaded and
split into words. The number of words and unique words is counted and stored in the data-
base:

import os.path
import re
import MySQLdb

FOLDER = "..."#path to plain text files saved before


i=0
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)

for filename in os.listdir (FOLDER):#iterate over files in folder


if "txt.txt" in filename: #plain text representations of articles were saved as *txt.txt files
y=open(FOLDER+filename,"r").read()
temptextsplit=re.findall('\w+\S+|[^\w\s]+',y.lower()) #split text at whitespaces, convert to lower case
tempvektordict={} #create dictionary with word frequencies
tempvektordicthandler=tempvektordict.get
for item in temptextsplit:
tempvektordict[item] = tempvektordicthandler(item, 0) + 1
print len(temptextsplit) #words without spaces
print len(tempvektordict) #uniquewords
cursor.execute ("UPDATE samplepagescreated2007 SET words=%s, wordpool =%s WHERE rev_page =
%s",(len(temptextsplit),len(tempvektordict),filename.replace("txt.txt",""),)) #update in db
print 5*"-"

Roman Pickl | The Performance of Online Communities | 65 |


The total number of unique links in each article was calculated by the following script. Again,
the last revision of each article is determined from the XML file and the respective HTML file
is obtained from Wikipedia. The content of the article is extracted, some clean-ups conducted
and all unique links are counted (page internal links are omitted):

# -*- coding: cp1252 -*-

import urllib
import urllib2
import re
import xml.etree.cElementTree as ElementTree
import MySQLdb
import datetime
import time

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Fire-fox/2.0.0.11'


headers = { 'User-Agent' : user_agent }

def getlastid(doc,rev_page): #find out id of last revision in 2007


filecelement =open(doc, "r") #assumption doc already exists!
newerthan2007=False
articleidfound=False # there are several id fields (article,revision,user)
upperlimit=datetime.datetime(2008, 1, 1)#
inrevision=False
revisionid=0

for event, elem in ElementTree.iterparse(filecelement):

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}revision":
inrevision=True

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}timestamp":
currentzeit=elem.text
if datetime.datetime.strptime(elem.text, "%Y-%m-%dT%H:%M:%SZ")>=upperlimit#if newer than 2007
newerthan2007=True
else:
lastrevisionid=revisionid #as the timestamp tag comes after the revision tag, assign here

if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}id":
if inrevision==True:
revisionid=elem.text
inrevision=False

print lastrevisionid

return lastrevisionid

conn = MySQLdb.connect (host = "127.0.0.1",


user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor) #connect to db

pagequery=cursor.execute ("select rev_page from samplepagescreated2007 where problem=0 and uniquesum-


links is NULL")

Roman Pickl | The Performance of Online Communities | 66 |


pages = cursor.fetchall()
i=0
problem=0
for page in pages:
if problem ==0: #stop if there is a problem
i+=1
print i," artikel"

print page['rev_page']
rev_page=str(page['rev_page'])
lastrevisionid=getlastid(doc=rev_page+".xml",rev_page=rev_page)

url ="http://de.wikipedia.org/w/index.php"

values = {'title' : "", "curid" : rev_page, "oldid" : lastrevisionid}

data = urllib.urlencode(values)
print "http://de.wikipedia.org/wiki/index.php?"+data+'""'
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()

#extract content
content=the_page[the_page.index("<!-- start content -->"):the_page.index("<!-- end content -->")]

#do some clean-ups


content=re.sub("""<div class="printfooter">\n.*</div>""","",content)
content=content.replace("""<script type="text/javascript">\n//<![CDATA[\n if (window.showTocToggle) {
var tocShowText = "Anzeigen"; var tocHideText = "Verbergen"; showTocToggle(); } \n//]]>\n</script>\n""","")
#delete toggle link//
content=re.sub("Eine <.*?>gesichtete Version</a> dieser Seite, <.*?>freigegeben</a> am <i>.*?</i>, ba-
siert auf dieser Version.","",content)
content=re.sub("<span class=\"editsection\">\[<a href=.*?Bearbeiten</a>]</span>","",content)#delete edit
links
content=re.sub(re.compile("<table class=\"metadata\".*?</table>",re.DOTALL),"",content)#delete meta-
data links not visible
content=re.sub(re.compile("<span class=\"metadata\".*?</span>",re.DOTALL),"",content)#delete metadata
links not visible

alllinks=re.findall("<a href=\"([^\"]*)\"",content)
extandintlinks=re.findall("<a href=\"([^#][^\"]*)\"",content)
pageintlinks=re.findall("<a href=\"(#[^\"]*)\"",content)
extlinks=re.findall("<a href=\"(http://[^\"]*)\"",content)

#add https links


extlinks.extend(re.findall("<a href=\"(https://[^\"]*)\"",content))

#add ftp links


extlinks.extend(re.findall("<a href=\"(ftp://[^\"]*)\"",content))

#add newsgrouplinks
extlinks.extend(re.findall("<a href=\"(news://[^\"]*)\"",content))

#add maillinks
extlinks.extend(re.findall("<a href=\"(mailto:[^\"]*)\"",content))

relintlinks=re.findall("<a href=\"(/[^\"]*)\"",content)
print "extandintlinks",len(set(extandintlinks)) #alle unique links #

Roman Pickl | The Performance of Online Communities | 67 |


cursor.execute ("UPDATE samplepagescreated2007 SET uniquesumlinks=%s WHERE rev_page =
%s",(len(set(extandintlinks)),rev_page,)) #update all

7.2.2.3 Analyses based on the Database

For calculating activity and focus in the analyzed communities another table was created con-
taining all user-page combinations in the sample and the number of edits by each user. Due
the fact that vandalism was only detected in less than 1% of all analyzed communities, poten-
tial vandals were not removed from these analyses as doing so would have complicated data-
base queries significantly. Again, bots were excluded:

create table characuser as select up1.rev_page,rev_user_text,`count(*)` from userpages2007 as up1 inner join
samplepagescreated2007 as up2 on up1.rev_page=up2.rev_page where up2.problem=0 and up2.userscalc>1 and
up1.bot=0;

alter table characuser add column edits2007 int unsigned; #edits by each user in 2007 in Wikipedia
alter table characuser add column percentofedits double unsigned;# used to calculate ratio (edits in arti-
cle/edits2007)

In a first step, the number of edits by each user in the sample in 2007 (activity) were calcu-
lated with the following function:

# -*- coding: cp1252 -*-


import xml.etree.cElementTree as ElementTree
import datetime
import time
import urllib
import re
import numpy
import os
import pylab
import math
import MySQLdb

def calc_editsuser2007(): #actvity


conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",#"root"
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)
#alle user
cursor.execute ("select distinct rev_user_text from characuser where edits2007 is Null")
resultuser=cursor.fetchall()

for item in resultuser:

cursor.execute ("select sum(`count(*)`) from userpages2007 where rev_user_text=%s and


page_namespace=0",(item["rev_user_text"],))
result = cursor.fetchone()
print item["rev_user_text"],result["sum(`count(*)`)"]

Roman Pickl | The Performance of Online Communities | 68 |


cursor.execute ("UPDATE characuser SET edits2007= %s WHERE rev_user_text =
%s",(result["sum(`count(*)`)"],item["rev_user_text"],)) #do for all

In a next step, the ratio of edits in the analyzed article and in other Wikipedia articles in 2007
was calculated for each user-page combination:

# -*- coding: cp1252 -*-


import xml.etree.cElementTree as ElementTree
import datetime
import time
import urllib
import re
import numpy
import os
import pylab
import math
import MySQLdb

def calc_percentofedits2007(): #focus


conn = MySQLdb.connect (host = "127.0.0.1",
user = "username"
passwd = "password"
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)
#calculate for every user/rev_page combination
cursor.execute ("select rev_page,rev_user_text,`count(*)`/edits2007 as ratio from characuser where percen-
tofedits2007 is Null;")
resultcombinations=cursor.fetchall()

for item in resultcombinations:


cursor.execute ("UPDATE characuser SET percentofedits2007= %s WHERE rev_user_text = %s and
rev_page=%s",(item["ratio"],item["rev_user_text"],item["rev_page"])) #do for all

The average activity and focus of each community was then stored in a temporary table and
updated in the samplepagescreated2007 table:

create temporary table avgfocus_activitytemp as select rev_page, avg(edits2007), avg(percentofedits2007) from


characuser group by rev_page order by rev_page asc;

update samplepagescreated2007 as t1, avgfocus_activitytemp as t2 set t1.avgactivity=t2.`avg(edits2007)`,


t1.avgfocus=t2.`avg(percentofedits2007)` where t1.rev_page=t2.rev_page;

To calculate the heterogeneity of an online community’s users, the number of articles each
user edited in 2007 was determined by adding a column to the characuser table in a first step.

alter table characuser add column articles2007 int unsigned; # nr. of articles edited by each user in 2007

These fields were populated with the help of the following Python script …

# -*- coding: cp1252 -*-


import xml.etree.cElementTree as ElementTree
import datetime
import time

Roman Pickl | The Performance of Online Communities | 69 |


import urllib
import re
import numpy
import os
import pylab
import math
import MySQLdb

def calc_articlesuser2007():
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username”
passwd = “password”
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)
#get all users
cursor.execute ("select distinct rev_user_text from characuser where articles2007 is NULL")
resultuser=cursor.fetchall()

for item in resultuser:

cursor.execute ("select count(*) from userpages2007 where rev_user_text=%s and


page_namespace=0;",(item["rev_user_text"],)) #get number of articles
result = cursor.fetchone()
print item["rev_user_text"],result["count(*)"]
cursor.execute ("UPDATE characuser SET articles2007= %s WHERE rev_user_text =
%s",(result["count(*)"],item["rev_user_text"],)) #update table

…and the average number of articles users in the sample contributed to was calculated.

create temporary table nrofarticles2007 as select distinct rev_user_text,articles2007 from characuser;


select avg(articles2007) from nrofarticles2007;

The yielded figure (265) was used to compare community users on the most important articles
each community edited and to calculate the average heterogeneity of each community with
the help of the cosine similarity function:

def calc_heterogeneity(rev_page,vandalismuser):
pagevektor=[]
uservektordict={}
if len(vandalismuser)==0:
string="''"
else:
string=",".join(vandalismuser) #join vandalismuser string (vandals identified)

conn = MySQLdb.connect (host = "127.0.0.1",


user = "username"
passwd = "password"
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)
dimensions=265 #compare community users on the 265 most important articles of each community
cursor.execute ("select count(*),up1.rev_page,up1.page_title from userpages2007 as up1 inner join user-
pages2007 as up2 on up1.rev_user_text=up2.rev_user_text where up2.rev_page=%s and up2.bot=0 and
up1.page_namespace=0 and up2.rev_user_text not in ("+string+") group by up1.rev_page Order by `count(*)`
desc limit %s;",(rev_page,dimensions))
resultpages = cursor.fetchall()

Roman Pickl | The Performance of Online Communities | 70 |


for page2 in resultpages:
pagevektor.append(page2['rev_page'])

sortedpagevektor=sorted(pagevektor)
stringmostcommonsites=",".join([str(el) for el in sortedpagevektor])#create a vector of the most important
articles

cursor.execute ("select distinct rev_user_text from userpages2007 where bot=0 and page_namespace=0 and
rev_page=%s and rev_user_text not in ("+string+");",(rev_page,))
articleusers = cursor.fetchall() #users of the article without vandals

usernumber=0
if len(articleusers)>1:
for user in articleusers:
uservektor=numpy.zeros(len(sortedpagevektor),dtype=int)#create vector for each user
i=0
usersites=cursor.execute("select * from userpages2007 where rev_user_text=%s and
page_namespace=0 and rev_page in ("+stringmostcommonsites+");",(user['rev_user_text'],))
resultnumber=cursor.fetchall()

dictforuser={}
for item in resultnumber:
dictforuser[item['rev_page']]=item['count(*)'] #populate vector with numbers of edits per article
for page in sortedpagevektor:
if dictforuser.has_key(page):
uservektor[i]=dictforuser[page]
i+=1
uservektordict[usernumber]=uservektor
usernumber+=1

mat=numpy.zeros((len(articleusers),len(articleusers))) #create matrix


usernumber=0
for user in articleusers:
usernumber2=0
for user2 in articleusers:

mat[usernumber,usernumber2]=mat[usernumber2,usernumber]=float(numpy.dot(uservektordict[usernumber],use
rvektordict[usernumber2])) / (numpy.linalg.norm(uservektordict[usernumber]) *
numpy.linalg.norm(uservektordict[usernumber2]))#compare each user pair (cosim)
usernumber2+=1
usernumber+=1
heterogeneity=1-(mat.sum()-len(articleusers))/(len(articleusers)*(len(articleusers)-1)) #calculate average
heterogeneity(1-average similarity)
else:
heterogeneity=0 #if there is only one user heterogeneity =0
return heterogeneity

To assess the dynamics of collaboration within each article the time difference between each
revision and the following revision was determined and the median calculated.

# -*- coding: cp1252 -*-


import xml.etree.cElementTree as ElementTree
import datetime
import time
import urllib
import re

Roman Pickl | The Performance of Online Communities | 71 |


import numpy
import os
import pylab
import math
import MySQLdb

def calc_mediantimebetweeneditsarticle2007(): #dynamics


conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)
#all user:
cursor.execute ("select distinct rev_page from samplepagescreated2007 where problem=0 and userscalc>1
and mediantimebetweenedits is NULL")
resultpages=cursor.fetchall()
i=0
for item in resultpages:
timediffs=[]
cursor.execute("SELECT @seq:=0;")
cursor.execute("drop table if exists revisiontemp2007;")
cursor.execute("create table revisiontemp2007 as SELECT @seq:=@seq+1 as Rank, t1.rev_id,
t1.rev_page,t1.rev_user_text,t1.rev_timestamp from revision2007 as t1 where t1.rev_page=%s and
t1.rev_user_text not in(select rev_user_text from bots) order by t1.rev_timestamp",(item["rev_page"],))# con-
secutively number revisions in a new table, omit bots
cursor.execute("SELECT
t1.rank,t2.rank,timestamp(t1.rev_timestamp),timestamp(t2.rev_timestamp),TIMESTAMPDIFF(second,t2.rev_ti
mestamp,t1.rev_timestamp) AS timedif FROM revisiontemp2007 AS t1, revisiontemp2007 AS t2 WHERE
t1.rank = t2.rank+1;")
results=cursor.fetchall()
for timeitem in results:
timediffs.append(timeitem["timedif"])
print item["rev_page"],";",numpy.median(timediffs) #calculate median

cursor.execute ("UPDATE samplepagescreated2007 SET mediantimebetweenedits= %s where


rev_page=%s",(numpy.median(timediffs),item["rev_page"],)) #do for all

Finally, the control variable article age in seconds was calculated with a standard SQL- state-
ment subtracting the creation date of an article from 01.01.2008 00:00:00.

select rev_page, TIMESTAMPDIFF(second,creationdate,20080101000000) from samplepagescreated2007


where problem=0 and userscalc>1;

Roman Pickl | The Performance of Online Communities | 72 |


7.3 References used in the Appendix:

[1]…Wikipedia 2009, Special:Export, viewed 23.04.2009,


<http://en.wikipedia.org/wiki/Special:Export>.

[2]…Wikipedia 2009, Virtual community, viewed 23.04.2009,


<http://en.wikipedia.org/wiki/Virtual_community>.

[3]…Mediawiki Repository 2009, viewed 16.04.2009,


<http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sql>.

[4]…Wikimedia 2008, dewiki dump progress on 20080607, viewed 16.04.2009,


<http://download.wikimedia.org/dewiki/20080607/>.

[5]…Mediawiki 2009, MWDumper, viewed 16.04.2009,


<http://www.mediawiki.org/w/index.php?title=MWDumper&oldid=242629>.

[6]…Wikipedia 2009, Benutzerverzeichnis, viewed 16.04.2009,


<http://de.wikipedia.org/w/index.php?title=Spezial%3ABenutzer&username=&group=bot&li
mit=5000>.

[7]…Wikipedia 2009, Namespace, viewed 16.04.2009,


<http://en.wikipedia.org/w/index.php?title=Wikipedia:Namespace&oldid=275699788>.

[8]…Mediawiki 2009, Identifying a page or revision, viewed 16.04.2009,


<http://www.mediawiki.org/wiki/Manual:Parameters_to_index.php#Identifying_a_page_or_r
evision>.

[9]…Gude 2008, wikipedia-article-exporter, viewed 16.10.2008,


<http://code.google.com/p/wikipedia-article-exporter/>.

[10] stilversprechend.de 2009, viewed 16.4.2009, <http://www.stilversprechend.de>.

Roman Pickl | The Performance of Online Communities | 73 |