You are on page 1of 8

FOCUS: ACTIONABLE ANALYTICS

How Robust and the rest of the team can still


stick to the original project sched-
ule. It is equally, if not more, impor-

Is Your
tant in an open source environment
because of the higher likelihood of
team change—contribution to open

Development
source projects is purely voluntary.
In this article, we offer an auto-
matic approach to evaluate team
robustness based on social network

Team? analysis. 3 Our approach is com-


posed of three major steps:

1. mining a project repository to


Lu Xiao, Zhongyuan Yu, Bohong Chen, and Xiao Wang, Stevens construct a weighted collabora-
Institute of Technology tion graph (WCG),
2. developing a novel Two-Layer
Collaboration Hierarchy (TLCH)
// A proposed automatic approach intuitively to analyze the collaboration
structure hierarchy, and
visualizes development team hierarchy,
3. evaluating team robustness by
quantifies overall team robustness, and estimating the information loss
identifies the point of risk for team robustness. associated with team members.

An investigation of six Apache open source We investigated six Apache open


projects has shown its effectiveness. // source projects, and the results show
that our approach greatly promotes
the understanding of development
team robustness and successfully
quantifies the robustness.

Our Approach
Our approach to evaluating develop-
ment team robustness is composed
of the following three parts.

Mining the Weighted


THE DEVELOPMENT AND mainte- leave the team due to various rea- Collaboration Graph
nance of a software system is a col- sons, such as job hopping, retire- First, we mine the project devel-
laborative activity. In particular, a ment, and sick leave, 2 how much oper team mailing list to construct
complex, large-scale software sys- could this change potentially disrupt a WCG. In a WCG, each node rep-
tem requires hundreds of developers the rest of the team? Ideally, the rest resents an individual developer,
contributing simultaneously, add- of the team should be able to quickly weighted by the total number of
ing new features, testing, and fixing adjust and make a smooth transi- email communications the developer
bugs. A robust development team is tion. Team robustness is critical for participated in. A node with a high
a necessity for the success and on- overall project progress in a com- weight indicates a developer who
time delivery of the software product mercial environment—the overall carries a significant share of knowl-
being built.1 In the event that devel- progress hopefully won’t be substan- edge of the system. If such a devel-
opers temporarily or permanently tially dragged down by team change, oper leaves the team, it is likely to

64 I E E E S O F T WA R E | PUBLI S HED BY THE IEEE COMPUTER SO CIE T Y 0 74 0 -74 5 9 / 1 8 / $ 3 3 . 0 0 © 2 0 1 8 I E E E


significantly affect the rest of the team. An edge between tune this for more meaningful results in the TLCH. We
two nodes is weighted by the number of shared email will discuss this as a limitation later.
threads between the two associated developers. The edge
weight implies the intensity of collaboration between Estimating the Information Loss Associated
the developers. In a high-intensity-collaboration envi- with Developers
ronment, if a core developer leaves, other developers are To quantitatively evaluate the robustness of a develop-
highly likely to be impacted directly. ment team, we estimate the information loss associated
with each developer, especially those developers in the
Calculating and Visualizing the Collaboration Hierarchy inner layer of the TLCH.
Next, we analyze the structure of the WCG. The goal is The information loss associated with a developer,
to understand the hierarchy (if one exists) of the devel- IL dev, is calculated as the percentage of email threads
opment team. For example, if a single developer has a that developer participated in. Intuitively, it reflects the
significant amount of collaboration with others, we con- mass of information the developer possesses regarding
sider that he or she is in a higher level of the hierarchy this project. Therefore, we use it to measure the percent-
than less-involved developers. age of information that will potentially get lost with the
We developed an algorithm to cluster a WCG into a absence of the developer. The larger the value for IL dev,
simple TLCH. As the name suggests, the TLCH sepa- the higher the potential disruptions to normal operation
rates the developers into two layers: the inner layer and of the team caused by that developer’s absence. Formally,
the outer layer. The rationale of the choice of two lay-
ers is to distinguish core developers in a team from the WCG.dev ( ) .weight ( )
ILdev = N
. (1)
others. We hypothesize that these core developers are
the point of risk for the robustness of the entire develop- ∑WCG.dev ( ) .weight ( )
i
i =1
ment team. If they become unavailable, the overall proj-
ect development will be significantly disrupted. We will To further estimate the overall robustness of a team,
provide measurements for estimating the robustness of a we calculate the core information loss (CoreIL) as the
team using the TLCH later. average information loss of all the inner-layer developers:
The inner layer contains nodes whose weight is at
1  
least one standard deviation above the mean weight of
the nodes in the WCG. The other nodes are aggregated
CoreIL = ln  ∑
Inner _ Layer  i ∈ Inner _ Layer
ILdevi  .  (2)

into the outer layer. If developers in the inner layer leave
the development team, the remaining team will be less CoreIL measures the potential information loss along
stable compared to if someone in the outer layer leaves. with the absence of any developer in the inner layer of
One of the common outlier detection methods—a mod- the TLCH. The higher this value, the higher the disrup-
ified z-score (based on the median absolute deviation)4 — tion in the event of an inner-layer developer’s absence,
serves as the criterion for a TLCH. If no outlier is de- and thus the less robust the development team will be.
tected, the development team cannot be constructed as Only the inner-layer developers are considered in the
a TLCH; therefore, there will be only one layer in the overall robustness measurement, because these develop-
WCG, which suggests that the team has a “flat” collabo- ers are the point of risk for team robustness.
ration property. If one or more than one outliers are de-
tected for the project, a TLCH will be applied. Teams Case Study Results
without a TLCH are considered to be more robust than We studied six open source projects using this approach.
teams with a TLCH.
In a constructed TLCH, two layers are laid out in two Case Study Subjects
inclusive circles; both the node size and edge thickness The study includes six Apache open source projects, in-
indicate the respective weight. The TLCH structure is vi- cluding Cassandra, Hadoop, PDFBox, CXF, HBase, and
sualized in Figure 1 and discussed in more detail later. Camel. The basic information for each subject is shown
Admittedly, “one standard deviation different” is not the in Table I. It shows the covered length of history in this
only feasible choice to distinguish core developers from study, the total number of developers, and the total num-
others. Project insiders may have better knowledge to ber of email communications.

J A N U A R Y/ F E B R U A R Y 2 0 1 8 | I E E E S O F T WA R E 65
FOCUS: ACTIONABLE ANALYTICS

(a) (b)

(c) (d)

(e) (f)

FIGURE 1. Development team hierarchy. (a) Hadoop. (b) Cassandra. (c) CXF. (d) PDFBox. (e) Camel. (f) HBase.

66 I E E E S O F T WA R E | W W W. C O M P U T E R . O R G / S O F T W A R E | @ I E E E S O F T WA R E
Collaboration Hierarchy Visualization
Figure 1 shows the TLCH calculated Table 1. Subject projects.
from each project. (For the sake of
readability, each graph only shows Length of history—mm/yy
Subject (no. of mos.) No. of developers No. of emails
up to the top 20 most contributing
developers from each project.) In this Cassandra 09/09 to 12/16 (87) 88  1,479
article, the TLCH for each project is
Hadoop 08/09 to 12/16 (88) 64 18,675
calculated as a static view covering
the selected period shown in Table I. PDFBox 08/09 to 02/17 (90) 16 12,105
Applying the TLCH approach, all,
CXF 12/07 to 02/17 (110) 43  3,971
except Hadoop, contain the two
hierarchical layers. HBase 12/09 to 02/17 (86) 63 19,613
Each calculated TLCH contains
Camel 07/08 to 12/16 (101) 42 17,281
two (CXF and PDFBox) to four
(Cassandra) developers in the inner
layer. These few developers have a TLCH are likely to be the point of code contribution percentage made
significant amount of collaboration risk for team robustness, we cal- by each core developer (if any core
among themselves and usually (ex- culate the CoreIL associated with developers exist) to the project
cept for PDFBox) are highly con- these developers. Inner-layer devel- code base. If core members contrib-
nected with other developers in the opers are highlighted as the thicker ute the majority of changes to the
team. Thus, they play critical roles dots in Figures 2b to 2f. The CoreIL code base, these core developers are
in maintaining the stability of team in the five projects with a hierarchi- indeed the point of risk for team ro-
collaboration due to their significant cal structure is between 0.346 (CXF bustness. In other words, when those
share of knowledge of the system. in Figure 2c) and 0.473 (PDFBox in core developers become unavailable,
If they leave, the team is subject to Figure 2d). This indicates that if any daily code revisions will be signifi-
severe information loss. Therefore, of the inner-layer developers become cantly disrupted. In contrast, if con-
these inner-layer developers are the unavailable, around 34.6% to 47.3% tributions by core members are less
point of risk for the robustness of of the project information might be significant, the project’s daily opera-
each team. lost. In particular, PDFBox has the tions would not be affected much.
In comparison, Hadoop shows a least-robust team according to the Table 2 lists the evaluation results.
flat team structure. This is because data, whereas the CoreIL reaches The second column shows the total
the Hadoop team does not meet the 0.473. And, if the two developers in number of code revisions submitted
criteria of the TLCH algorithm. In- the inner layer leave, the total infor- by the entire development team of
tuitively, this implies that the collab- mation loss is more than 60%. This a project during the studied period.
oration among developers is evenly will cause significant disruptions to Columns 3 to 6 list the contribution
distributed among all the team mem- the daily operations of the team. percentage of each inner-layer mem-
bers. We believe that a flat structure In comparison, since Hadoop ber in a project. The core members
is more robust than the hierarchy has a flat team structure as shown are in the inner layer of the calcu-
structure because no matter who in Figure 1a, we calculate the aver- lated TLCH as shown in Figure 1.
leaves the team, only a small propor- age IL dev of the top four develop- The last column shows the total con-
tion of the project knowledge will be ers, which is only 0.187. This implies tribution percentage of all the inner-
lost. Hence, the whole team is more much more affordable information layer members. Since Hadoop has a
resilient to risk. loss from an individual develop- flat structure, the last row in Table 2
er’s absence, compared to the other just lists the maximal, average, and
Information Loss projects. standard deviation of the individual
Figure 2 shows the information loss developer’s contributions.
trend along with the top four de- Evaluation We can make the following ob-
velopers in each project. Since the We mined the revision history of servations from Table 2. First, the
developers in the inner layer of the each project to calculate the actual few inner-layer members (up to four

J A N U A R Y/ F E B R U A R Y 2 0 1 8 | I E E E S O F T WA R E 67
FOCUS: ACTIONABLE ANALYTICS

100 100
Average IL, 0.187 Core member IL, 0.464
80 80

Information loss (%)


Information loss (%)

60 60

40 40

20 20

0 0
0 1 2 3 4 5 0 1 2 3 4 5
(a) Top developers (b) Top developers

100 100
Core member IL, 0.346 Core member IL, 0.473
80 80
Information loss (%)
Information loss (%)

60 60

40 40

20 20

0 0
0 1 2 3 4 5 0 1 2 3 4 5
(c) Top developers (d) Top developers

100 100
Core member IL, 0.435 Core member IL, 0.395
80 80
Information loss (%)

Information loss (%)

60 60

40 40

20 20

0 0
0 1 2 3 4 5 0 1 2 3 4 5
(e) Top developers (f) Top developers

FIGURE 2. Team information loss (IL) with absent developers. (a) Hadoop. (b) Cassandra. (c) CXF. (d) PDFBox. (e) Camel. (f) HBase.

developers) together make a signifi- be disrupted significantly—who can Second, in comparison, in Hadoop,
cant contribution—from 49% (Cas- replace them and make the large the individual developer usually makes
sandra) to 61% (PDFBox)—to the percentage of revisions? In particu- a relatively trivial percentage of re-
code base of each project. This im- lar, the top core member alone con- visions. The maximal individual
plies that when these few developers tributes 26% (Cassandra) to 49% contribution is 5%, and the indi-
become unavailable, the daily code (PDFBox) of the revisions to the vidual contribution average is only
revision of the projects will likely entire code base. 0.69%, with a standard deviation

68 I E E E S O F T WA R E | W W W. C O M P U T E R . O R G / S O F T W A R E | @ I E E E S O F T WA R E
of 1%. This implies that Hadoop
indeed has the most robust team Table 2. Core developer contributions.
structure.
No of. Core 1 Core 2 Core 3 Core 4
In summary, the data show that Project revisions (%) (%) (%) (%) Total (%)
the TLCH of the developing team
and the information loss associated Cassandra 23,169 26 12 9 2 49
with the top four developers cal- CXF 13,393 30 28 N/A N/A 58
culated by our approach can faith-
fully reflect the actual code base PDFBox  6,204 49 12 N/A N/A 61
contributions of the core develop- Camel 29,166 44 14 2 N/A 60
ers. Thus, our approach can provide
HBase 13,539 36 12 6 N/A 54
useful insights for evaluating team
robustness. Hadoop 16,196 Max. deviation 5 5%, avg. deviation 5 0.69%, and N/A
standard deviation 5 1%
Limitations and
Future Work
In this article, we analyzed email data
to construct developers’ collabora- our approach to visualize and httpd. 5 They found that the mes-
tion links. Developers’ collaboration monitor the dynamics of the sages sent by an individual and the
can take other forms—for example, collaboration hierarchy over number of source changes that in-
bug-tracking systems, shared code time, which can provide in- dividual makes have a Spearman’s
ownership, and so on. We acknowl- sights in analyzing a potential rank correlation of about 0.8. This is
edge the limitation of considering increase or decrease of team consistent with our evaluation results:
only email exchanges. However, our robustness. the top few inner-layer developers in
proposed approach can be general- • Although the TLCH algorithm each project contribute a majority
ized to collaboration links extracted reveals meaningful team struc- of the code revisions. Chris Jensen
from other data sources. ture for the six projects with and Walt Scacchi proposed an “on-
When calculating the TLCH, 16 to 88 developers, we ac- ion” diagram to represent the dif-
the inner layer is distinguished knowledge that it may not work ferent roles of open source software
from the outer layer if the weight properly for an ultra-large-scale developers: active users, develop-
of a node is one standard deviation development team with more ers, project managers, community
above the mean. “One standard de- than hundreds of developers. managers, core developers, passive
viation” is not the only feasible ap- But, based on similar rationale, users, and observers.7 Andrew Meneely
proach. Users are suggested to tune we plan to extend the TLCH and Laurie Williams found that de-
this value depending on the project to a Multilayer Collaboration veloper social network measures,
circumstances. Hierarchy to describe the team such as the edges, social distance,
We plan to address the following structure. and network centrality, are consis-
in our future work: tent with developers’ perception of
Related Work their actual collaborations.8 Gustavo
• Here, we apply our approach to In the past decades, numerous stud- Oliva and his colleagues reported
the six Apache open source proj- ies examined the social structure that only 25% of the developers in
ects. We plan to evaluate and of software development teams us- a project may be considered as key
apply this approach on a broader ing different methods for different developers, who are often active in
spectrum of projects with more goals. 5–10 This section compares the mailing list and fulfill the coor-
diverse characteristics. this article with the most relevant dination requirements.9 Recently,
• Currently, the TLCH is con- prior work. Mitchell Joblin and his colleagues
structed based on data for a Christian Bird and his colleagues reported that network metrics are a
selected period of time and mined the social network from better data source to capture the core
thus is static. We plan to apply the public email archive of Apache developers in a project, compared

J A N U A R Y/ F E B R U A R Y 2 0 1 8 | I E E E S O F T WA R E 69
FOCUS: ACTIONABLE ANALYTICS

managers can leverage this approach


ABOUT THE AUTHORS

to periodically monitor the robust-


LU XIAO is an assistant professor in the Software Engineer- ness of the development team and
ing Division of the School of Systems and Enterprises at identify the points (developers) of
Stevens Institute of Technology. Her research interests include risk for team robustness. A potential
software architecture, software evolution and maintenance, solution to improve robustness could
software economics, and software ecosystems. Xiao received be regular work rotation or improved
a PhD in computer science from Drexel University. Contact her collective code ownership.
at lxiao6@stevens.edu.
References
1. B.W. Boehm, “Software Risk Man-
ZHONGYUAN YU is a research assistant professor in agement: Principles and Practices,”
Stevens Institute of Technology’s School of Systems and IEEE Software, vol. 8, no. 1, 1991,
Enterprises. Her research interests include applied statistics, pp. 32–41.
data visualization, simulation, text mining, and socioeconom- 2. A. Cockburn and J. Highsmith, “Ag-
ics to facilitate strategic decision making. Yu received a PhD ile Software Development, the People
in system engineering from Stevens Institute of Technology. Factor,” Computer, vol. 34, no. 11,
Contact her at zyu7@stevens.edu. 2001, pp. 131–133.
3. E. Otte and R. Rousseau, “Social
Network Analysis: A Powerful Strat-
BOHONG CHEN is a master’s student in business intelligence egy, Also for the Information Sci-
at Stevens Institute of Technology. His research interests are ences,” J. Information Science,
social network analytics, process optimization, and pattern vol. 28, no. 6, 2002, pp. 441–453.
recognition. Chen received a bachelor of engineering from 4. B. Iglewicz and D.C. Hoaglin, How
Shanghai Normal University. Contact him at chenbohong@ to Detect and Handle Outliers, ASQ
hotmail.com. Press, 1993.
5. C. Bird et al., “Mining Email Social
Networks,” Proc. 2006 Int’l Work-
XIAO WANG is a PhD student in Stevens Institute of Tech- shop Mining Software Repositories
nology’s School of Systems and Enterprises. His research (MSR 06), 2006, pp. 137–143; doi.
interests are software architecture and software evolution and acm.org/10.1145/1137983.1138016.
maintenance. He received a master’s in electrical engineer- 6. C. Bird et al., “Latent Social Struc-
ing from Stevens Institute of Technology. Contact him at ture in Open Source Projects,” Proc.
xwang97@stevens.edu. 16th ACM SIGSOFT Int’l Symp.
Foundations of Software Eng.
(SIGSOFT 08/FSE 16), 2008, pp.
24–35; doi.acm.org/10.1145/1453101
.1453107.
7. C. Jensen and W. Scacchi, “Role Mi-

T
gration and Advancement Processes
to count-based metrics such as he study results suggest that in OSSD Projects: A Comparative
churn (changed lines of code) and our approach can effectively Case Study,” Proc. 29th Int’l Conf.
commits.10 help people intuitively un- Software Eng. (ICSE 07), 2007, pp.
Compared to the work we just derstand and quantitatively evalu- 364–374; dx.doi.org/10.1109/ICSE
mentioned, the uniqueness of this ate the robustness of a development .2007.74.
article is that it leverages the analy- team. Even though our approach 8. A. Meneely and L. Williams, “Socio-
sis of the developer social network has so far been applied only on open technical Developer Networks:
for a new angle: analyzing team source projects, it is directly appli- Should We Trust Our Measurements?,”
robustness. cable to commercial projects. Project Proc. 33rd Int’l Conf. Software Eng.

70 I E E E S O F T WA R E | W W W. C O M P U T E R . O R G / S O F T W A R E | @ I E E E S O F T WA R E
(ICSE 11), 2011, pp. 281–290; and Technology (CRIWG 12), 39th Int’l Conf. Software Eng. (ICSE
doi.acm.org/10.1145/1985793 2012, pp. 97–112; dx.doi. 17), 2017, pp. 164–174.
.1985832. org/10.1007/978-3-642-33284-5_8.
9. G.A. Oliva et al., “Character- 10. M. Joblin et al., “Classifying Develop- Read your subscriptions
izing Key Developers: A Case ers into Core and Peripheral: An Em- through the myCS
publications portal at
Study with Apache Ant,” Proc. pirical Study on Count and Network
18th Int’l Conf. Collaboration Metrics,” Proc. 2017 IEEE/ACM
http://mycs.computer.org

IEEE COMPUTER GRAPHICS AND APPLICATIONS


IEEE COMPUTER GRAPHICS AND APPLICATIONS
IEEE COMPUTER GRAPHICS AND APPLICATIONS

IEEE COMPUTER GRAPHICS AND APPLICATIONS


November/December 2016
July/August 2016 September/October 2016 January/February 2017

Quality
November/December 2016
September/October 2016

January/February 2017

Assessment
July/August 2016

Defense

C G &A
and
Perception Applications
Quality Assessment and Perception in Computer Graphics

in Computer Graphics
Water, Sky, and the Human Element
Sports Data Visualization

Defense Applications

VOLUME 37 NUMBER 1
VOLUME 36 NUMBER 5
VOLUME 36 NUMBER 4

VOLUME 36 NUMBER 6

c1.indd 1 12/14/16 12:21 PM

c1.indd 1 6/22/16 1:20 PM c1.indd 1 8/22/16 2:59 PM


c1.indd 1 10/24/16 3:44 PM

www.computer.org/cga
IEEE Computer Graphics and Applications bridges the theory
and practice of computer graphics. Subscribe to CG&A and
• stay current on the latest tools and applications and gain
invaluable practical and research knowledge,
• discover cutting-edge applications and learn more about
the latest techniques, and
• benefit from CG&A’s active and connected editorial board.

J A N U A R Y/ F E B R U A R Y 2 0 1 8 | I E E E S O F T WA R E 71

You might also like