Data

Our open source projects dataset was collected from GitHub39, a social coding platform which provides source code management and collaboration features such as bug tracking, feature requests, tasks management and wiki for every project. Given that GitHub users can star a project (to show interest in its development and follow its advances), we chose to measure the popularity of a GitHub project in terms of its number of stars (i.e. the more stars the more popular the project is considered) and selected the 100 most popular projects. Other possible criteria –number of forks, open issues, watchers, commits and branches– are positively correlated with stars17, and so our proxy to mature, successful and active projects probably overlaps with other sampling procedures. The construction of the dataset involved three phases, namely: (1) cloning, (2) import, and (3) enrichment.

Cloning and import

After collecting the list of 100 most popular projects in GitHub (at the moment of collecting the data) via its API40, we cloned them to collect 100 Git repositories. We analysed the cloned repositories and discarded those ones not involving the development of a software artifact (e.g. collection of links or questions), rejecting 15 projects out of the initial 100. We then imported the remaining Git repositories into a relational database using the Gitana41 tool to facilitate the query and exploration of the projects for further analysis. In the Gitana database, Git repositories are represented in terms of users (i.e. contributors with a name and an email); files; commits (i.e. changes performed to the files); references (i.e. branches and tags); and file modifications. For two projects, the import process failed to complete due missing or corrupted information in the source GitHub repository.

Enrichment

Our analysis needs a clear identification of the author of each commit so that we can properly link contributors and files they have modified. Unfortunately, Git does not control the name and email contributors indicate when pushing commits resulting on clashing and duplicate problems in the data. Clashing appears when two or more contributors have set the same name value (in Git the contributor name is manually configured), resulting in commits actually coming from different contributors appearing with the same commit name (e.g., often when using common names such as “mike”). In addition, duplicity appears when a contributor has several emails, thus there are commits that come from the same person, but are linked to different emails suggesting different contributors. We found that, on average, around 60% of the commits in each project were modified by contributors that involved a clashing/duplicity problem (and affecting a similar number of files). To address this problem, we relied on data provided by GitHub for each project (in particular, GitHub usernames, which are unique). By linking commits to unique usernames, we could disambiguate the contributors behind the commits. Thus, we enriched our repository data by querying GitHub API to discover the actual username for each commit in our repository, and relied on those instead on the information provided as part of the Git commit metadata. This method only failed for commits without a GitHub username associated (e.g. when the user that made that commit was no longer existing in GitHub). In those cases we stick to the email in Git commit as contributor identifier. We reduced considerably the clashing/duplicity problem in our dataset. The percentage of commits modified by contributors that may involve a clashing/duplicity problem was reduced to 0.004% on average (σ = 0.011), and the percentage of files affected was reduced to 0.020% (σ = 0.042).

At the end of this process, we had successfully collected a total number of 83 projects, adding up to 48,015 contributors, 668,283 files and 912,766 commits. 18 more projects (to the total of 65 reported in this work) were rejected due to other limitations. On one hand, we discarded some projects that presented very strong divergence between the number of nodes of the two sets, e.g. projects with very large number of files but very few contributors. In these cases, although \({\mathscr{N}}\), Q and \( {\mathcal I} \) can be quantified, the outcome is hardly interpretable. An example of this is the project material-designs-icons, with 15 contributors involved in the development of 12,651 files. As mentioned above, we also discarded projects that are not devoted to software development, but are rather collections of useful resources (free programming books, coding courses, etc.). Finally, we considered only projects with a bipartite network size within the range 101 ≤ S ≤ 104, as the computational costs to optimise in-block nestedness and modularity for larger sizes were too severe. The complete dataset with the final 65 projects is available at http://cosin3.rdi.uoc.edu, under the Resources section.

Matrix generation

We build a bipartite unweighted network as a rectangular N × M matrix, where rows and columns refer to contributors and source files of an OSS project, respectively. Cells therefore represent links in the bipartite network, i.e. if the cell a ij has a value of 1, it represents that the contributor i has modified the file j at least once, otherwise a ij is set to 0.

We are aware that an unweighted scheme may be discarding important information, i.e. the heterogeneity of time and effort that developers devote to files. We stress that including weights in our analysis can introduce ambiguities in our results. In the Github environment, the size of a contribution could be regarded either as the number of times a developer commits to a file, or as the number of lines of code (LOC) that a developer modified when updating the file. Indeed, both could represent additional dimensions to our study. Furthermore, at least for the first (number of commits), it is readily available from the data collection methods. However, weighting the links of the network by the number of commits is risky. Consider for example a contributor who, after hours or days of coding and testing, performs a commit that substantially changes a file in a project. On the other side, consider a contributor who is simply documenting some code, thus committing many times small comments to an existing software –without changing the internal logic of it. There is no simple way to distinguish these cases. The consideration of the second item (number of LOC modified) could be a proxy to such distinction, but this is information is not realistically accessible given the current limitations to data collection. Getting a precise number of LOCs requires a deeper analysis of the Git repository associated to the GitHub project, parsing the commit change information one by one –an unfeasible task if we aim at analysing a large set of projects. The same scalability issue would appear if we rely on the GitHub API to get this information, which additionally would involve quota problems with such API. One might consider even a third argument: not every programming language “weighs” contributions in the same way. Many lines of HTML code may have a small effect on the actual advancement of a project, while two brief lines in C may completely change a whole algorithm. In conclusion, we believe there is no generic solution that allows to assess the importance of a LOC variation in a contribution. This will depend first on the kind of file, then on the programming style of each project and finally on an individual analysis of each change. Thus, adding informative and reliable weights to the network is semantically unclear (how should we interpret those weights?) and operationally out of reach.

Nestedness

The concept of nestedness appeared, in the context of complex networks, over a decade ago in Systems Ecology42. In structural terms, a perfect nested pattern is observed when specialists (nodes with low connectivity) interact with proper nested subsets of those species interacting with generalists (nodes with high connectivity), see Fig. 2 (left). Several works have shown that a nested configuration is signature feature of cooperative environments –those in which interacting species obtain some benefit42,43,44. Following this example in natural systems, scholars have sought (and found) this pattern in other kinds of systems32,45,46,47. In particular, measuring nestedness in OSS contributor-file bipartite networks helps to uncover patterns of file development. For instance, in a perfectly nested bipartite network the most generalist developer has contributed to every file in the project, i.e. a core developer. Other contributors will exhibit decreasing amounts of edited files. On top of this hierarchical arrangement, we find asymmetry: specialist contributors (those working on a single file) develop precisely the generalist file, i.e. the file that every other developer also works on. Here, we quantify the amount of nestedness in our OSS networks by employing the global nestedness fitness \({\mathscr{N}}\) introduced by Solé-Ribalta et al.30:

$${\mathscr{N}}=\frac{2}{N+M}\{\mathop{\sum }\limits_{i,j}^{N}\,[\frac{{O}_{i,j}-\langle {O}_{i,j}\rangle }{{k}_{j}(N-1)}\Theta ({k}_{i}-{k}_{j})]+\mathop{\sum }\limits_{l,m}^{M}\,[\frac{{O}_{l,m}-\langle {O}_{l,m}\rangle }{{k}_{m}(M-1)}\Theta ({k}_{l}-{k}_{m})]\},$$ (1)

where O i,j (or O l,m ) measures the degree of links overlap between rows (or columns) node pairs; k i , k j corresponds to the degree of the nodes i,j; Θ(·) is a Heaviside step function that guarantees that we only compute the overlap between pair of nodes when k i ≥ k j . Finally, 〈O i,j 〉 represents the expected number of links between row nodes i and j in the null model, and is equal to \(\langle {O}_{i,j}\rangle =\frac{{k}_{i}{k}_{j}}{M}\). This measure is in the tradition of other overlap measures, i.e. NODF48,49.

Modularity

A modular network structure (Fig. 2, center) implies the existence of well-connected subgroups, which can be identified given the right heuristics to do so. Unlike nestedness (which apparently emerges only in very specific circumstances), modularity has been reported in almost any kind of systems: from food-webs50 to lexical networks51, to the Internet27 and social networks52. Applied to OSS developer-file networks, modularity helps to identify blocks of developers working together in a set of files. High Q values in OSS projects would reveal some level of specialisation (division of labour) in the development of the project. However, if an OSS project is only modular (i.e., any trace of nestedness is missing), it may reveal that, beyond compartmentalisation, no further organisational refinement is at work. Here, we search a (sub)optimal modular partition of the nodes through a community detection analysis26,27. To this end, we apply the extremal optimisation algorithm53 (along with a Kernighan-Lin54 refinement procedure) to maximise Barber’s26 modularity Q,

$$Q=\frac{1}{L}\mathop{\sum }\limits_{i=1}^{N}\,\mathop{\sum }\limits_{j=N+1}^{N+M}\,({\tilde{a}}_{ij}-{\tilde{p}}_{ij})\,\delta ({\alpha }_{i},{\alpha }_{j})$$ (2)

where L is the number of interactions (links) in the network, \({\tilde{a}}_{ij}\) denotes the existence of a link between nodes i and j, \({\tilde{p}}_{ij}={k}_{i}{k}_{j}/L\) is the probability that a link exists by chance, and δ(α i ,α j ) is the Kronecker delta function, which takes the value 1 if nodes i and j are in the same community, and 0 otherwise.

In-block nestedness

Nestedness and modularity are emergent properties in many systems, but it is rare to find them in the same system. This apparent incompatibility has been noticed and studied, and it can be explained by different evolutive pressures: certain mechanisms favour the emergence of blocks, while others favour the emergence of nested patterns. Following this logic, if two such mechanisms are concurrent, then hybrid (nested-modular) arrangements may appear. Hence, the third architectural organisation that we consider in our work refers to a mesoscale hybrid pattern, in which the network presents a modular structure, but the interactions within each module are nested, i.e. an in-block nested structure, see Fig. 2 (right). This type of hybrid or “compound” architectures was first described in Lewinsohn et al.28. Although the literature covering this types of patterns is still scarce, the existence of such type of hybrid structure in empirical networks has been recently explored29,30,55, and the results from these works seem to indicate that combined structures are, in fact, a common feature in many systems from different contexts.

In order to compute the amount of in-block nested present in networks, in this work, we have adopted a new objective function30, that is capable to detect these hybrid architectures, and employed the same optimisation algorithms used to maximise modularity. The in-block nestedness objective function can be written as,

$$ {\mathcal I} =\frac{2}{N+M}\{\mathop{\sum }\limits_{i,j}^{N}\,[\frac{{O}_{i,j}-\langle {O}_{i,j}\rangle }{{k}_{j}({C}_{i}-1)}\Theta ({k}_{i}-{k}_{j})\,\delta ({\alpha }_{i},{\alpha }_{j})]+\mathop{\sum }\limits_{l,m}^{M}\,[\frac{{O}_{l,m}-\langle {O}_{l,m}\rangle }{{k}_{m}({C}_{l}-1)}\Theta ({k}_{l}-{k}_{m})\,\delta ({\alpha }_{l},{\alpha }_{m})]\},$$ (3)

Note that, by definition, \( {\mathcal I} \) reduces to \({\mathscr{N}}\) when the number of blocks is 1. This explains why the right half of the ternary plot (Fig. 6) is necessarily empty: \( {\mathcal I} \ge {\mathscr{N}}\), and therefore \({f}_{ {\mathcal I} }\ge {f}_{{\mathscr{N}}}\). On the other hand, an in-block nested structure exhibits necessarily some level of modularity, but not the other way around. This explains why the lower-left area of the simplex in Fig. 6 is empty as well (see Palazzi et al.33 for details).

The corresponding software codes for nestedness measurement, and modularity and in-block nestedness optimisation (both for uni- and bipartite cases), can be downloaded from the web page http://cosin3.rdi.uoc.edu/, under the Resources section.

Stationarity test

Figures 1 and 5 visually suggest that some quantities do not vary as a function of project size –or vary very slowly. As convincing as this visual hint may result, a statistical test is necessary to confirm that, indeed, there is a limit on the quantity at stake. The idea of stationarity on a time series implies that summary statistics of the data, like the mean or variance, are approximately constant when measured from any two starting points in series (different project sizes in our case). Typically, statistical stationarity tests are done by checking for the presence (or absence) of a unit root on the time series (null hypothesis). A time series is said to have a unit root if we can write it as

$${y}_{t}={a}^{n}{y}_{t-n}+\sum _{i}\,{\varepsilon }_{t-i}{a}^{i}$$ (4)

where ε is an error term. If a = 1 the null hypothesis of non-stationarity is accepted. On the contrary, if a < 1 there is not unit root, and the process is deemed stationary. In this work, we have employed the Augmented Dickey-Fuller (ADF) test56, as implemented in the statsmodels.tsa.stattools Python package. The results of the analysis indicate that, if the test statistic is less than the critical values at different significance levels, then, the null hypothesis of a unit root is rejected, and we can conclude that the data series is stationary.