You are on page 1of 4

A Study of GitHub as a Collaborative Network

Adam EL BERGUI - mohamed asila


February 9, 2020

1 Introduction
GitHub is the most popular source code hosting and development service that supports distributed
teams working on large and small open-source software projects . Also, it supports a lot of features
such as code functionalities and messaging between users . GitHub contains more than 20 million
users and more than 57 million repositories . In this work we aim to study GitHub structure as a
collaborative network.

2 Research Questions & Hypothesis


2.1 GitHub as a graph
The relationship between users is defined by many forms of interaction. In this work, our goal is
to study the relationship between GitHub users by considering the collaboration aspect where :
• Nodes are GitHub committers .
• Edges exists if two users commit in the same repository.
• Weights represents the number of collaborations between two users.

2.2 Objectives
We aim to study different aspects of GitHub collaborative network , we represent our objectives
in the following questions .
• What is the structure of GitHub collaborative network ?
• Who is the most important user in each programming language ?
• What is the most popular programming language ?
• Is there any communities structure within the GitHub collaborative network ?
• How GitHub collaborative network evolves from 2017 to 2019 ?

3 Methodology
3.1 Data Gathering
In order to get data from GitHub , we use a python library PyGithub that based on Github rest
API. This library designed to return all the available information about hosted projects .
we are interested to investigate the collaboration between committers , To achieve our objectives
we retrieve the following informations :
• The repositories for each programming language created in 2017,2018,2019 .
• Committers per repositories in each programming language for the last three
years

1
3.2 Data Preprocessing
After gathering Data from GitHub using the rest API . In this phase we clean the data and
transform it to a collaborative network as described in the section 2.1 .

Figure 1: The Github collaboration network in 2019

3.3 Algorithms
In order to find the most important user/commiter in each programming language and the most
popular programming language in term of commits we use centrality measures such as :
• Degree Centrality
• Betweenness Centrality
• Closeness Centrality
• Eigenvector Centrality
• Eigenvector Centrality
• PageRank
To find the communities within this collaborative network , we use community detection
algorithms such as :
• The Louvain algorithm that aims to maximize the modularity score for each community
• Label propagation algorithm that detects the communities using network structure alone
without a prior information about the communities .

3.4 Results :
In this section we show the results of our work
RQ1 : How is the topology of Github collaborative network ?
Motivation : We aim to study the global structure of the collaborative network and find if
there is any significant topology that can help us to understand how Github users collaborate with
each other globally and also for each programming language .
Approach : We first visualize the Github collaborative network for each year (2017,2018,2019)
, and see what pattern occurs in the three networks by using the macro-level statistics for each
network . We also visualize the Github collaborative network for each programming language and
understand there structure by using the macro-level metrics described in Table 1 .

2
Figure 2: The Python collaboration network in 2019

Figure 3: The JavaScript collaboration network in 2019

Figure 4: The TypeScript collaboration network in 2019

Results : Figure 1 shows the collaborative network for 2019 and figure 2,3,4 shows the commit-
ters network for each programming language . It is clearly that the networks have a core-periphery
structure which there is a core nodes that are densely interconnected and peripheral nodes that are
not densely interconnected.Using the LapCore algorithm that detect core-periphery structure in
networks we found that the topology of the collaborative network for each programming language
is a core-periphery .
QR2 : Who is the most important user for each programming language ?
Motivation : We aim to find the users that are important in term of collaboration in Github
.This information can be used in recommendation system , to suggest the most contributor or the
committer who can collaborate with a large team for a specific language .
Approach : In order to find the most important committer We use different centrality measures
as described above 3.3 . We also find the users that commits in more than one programming
language if we define the important user as the one who can collaborate in many programming
languages .
Result : the most important user for each programming language using different centrality
measures are described in in table below .

3
Macro-level metrics Network 2019 Python 2019 JavaScript 2019 TypeScript 2019 Shell 2019
Number of nodes 4465 775 203 1810 541
Number of edges 113082 14211 2499 48750 25466
Density 0.01135 0.04738 0.12189 0.02978 0.1743
Average Degree 50.6526 36.6735 24.6207 53.86 94.14
Average clusetring 0.9656 0.9681 0.9781 0.96004 0.9905
Assortativity degree 0.85596 0.971 0.9274 0.9225 0.9992
The length largest component 3457 216 167 1605 202

Macro-level metrics C++ 2019 C# 2019 C 2019 Java 2019


Number of nodes 582 40 157 184
Number of edges 9807 158 2227 5486
Density 0.058 0.202 0.1818 0.3258
Average Degree 33.70 7.9 28.36 59.63
Average clusetring 0.9721 0.9 0.9913 0.980
Assortativity degree 0.9408 1 0.9712 0.9998
The length largest component 184 11 70 101

Table 1: Macro-level metrics for the github collaborative network for each programming language

Centrality measures Network 2019 Python JavaScript TypeScript Shell


Degree Centrality yangyuanguang yaoting.gao yangyuanguang Adam Vernon VirtuBox
Betweenness yangyuanguang yaoting.gao yangyuanguang Adam Vernon VirtuBox
Closeness yangyuanguang yaoting.gao yangyuanguang Adam Vernon VirtuBox
Eigen Vector yangyuanguang yaoting.gao yangyuanguang Adam Vernon VirtuBox
Users that commits more than one programming language
User Number of programming language
0xflotus 6
allcontributors[bot] 5
Microsoft Open Source 4
Microsoft GitHub User 4

3.5 Evaluations
We plan to use the modularity score to evaluate the communities structure within GitHub
collaborative network .

4 Related Works
• Strzalkowski, T., Harrison, T., Sa, N., Katsios, G., Khoja, E. (2018). GitHub as a Social
Network. Advances in Artificial Intelligence, Software and Systems Engineering, 379–390.

• Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D. M., Damian, D. (2014).
The promises and perils of mining GitHub. Proceedings of the 11th Working Conference on
Mining Software Repositories - MSR 2014.
• Asri, I., Kerzazi, N., Benhiba, L., Janati, M. (2017). From Periphery to Core: A Temporal
Analysis of GitHub Contributors’ Collaboration Network. IFIP Advances in Information and
Communication Technology, 217–229.

You might also like