Exploring homophily

A natural starting point for our study of diversity is to establish the extent to which homophily32 exists in academia—i.e., whether scientists tend to collaborate more frequently with similar others—which would lead to an overall lack of diversity in scientific collaborations. We use the Microsoft Academic Graph dataset (available at: https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/), and analyze 1,045,401 multi-authored papers (see Supplementary Figure 1 for the distribution of papers by year), written by 1,529,279 scientists, spanning eight main fields and 24 subfields of science. We analyzed diversity in terms of these five attributes: ethnicity (eth), discipline (dsp), gender (gen), affiliation (aff), and academic age (age); see Supplementary Note 1. Here, the abbreviations in parentheses are used in subsequent mathematical expressions to indicate the associated attribute. These attributes reflect many technical and social factors that influence teamwork and collaboration. Affiliation indicates the geographic location, and may even reflect the way collaborative work is carried out—from the style and culture of collaboration to its mundane details, such as the medium used to collaborate, e.g., face-to-face interactions vs. telecommunication or email. Academic age is not only indicative of the amount of experience that a scientist has, but is also typically associated with actual age. Discipline may reflect a scientist’s substantive knowledge and his/her acquired skills through training, as well as the culture in which collaborative work is carried out. Finally, ethnicity and gender may play a role in shaping scientists’ social identities, knowledge, and biases. To quantify diversity in terms of any of the aforementioned attributes, we use the Gini Impurity33, resulting in the following group diversity indices, \(d_{{\mathrm {eth}}}^{\mathrm {G}}\), \(d_{\mathrm {{age}}}^{\mathrm {G}}\), \(d_{{\mathrm {gen}}}^{\mathrm {G}}\), \(d_{{\mathrm {dsp}}}^{\mathrm {G}}\) and \(d_{{\mathrm {aff}}}^{\mathrm {G}}\) (an alternative diversity measure was also considered; see Supplementary Note 2 and Supplementary Figure 2).

To explore homophily, we generate different randomized baseline models whereby a particular attribute—be it ethnicity, gender, affiliation, or academic age—is shuffled. For example, in the case of ethnicity, this process is akin to creating a universe in which ethnicity is disregarded in the selection of co-authors, while retaining other criteria. To preserve the conditional distributions of the ethnicities, the shuffling process is constrained to only occur between authors of papers that have the same subfield, publication year, and number of authors; for full details, see Supplementary Note 3. This way, for every paper p in the real dataset, there exists a matching paper p′ in the randomized dataset that may differ from p in terms of ethnic diversity, but is identical to p in terms of gender, affiliation, academic age, citations, publication year, and number of authors per paper. Importantly, while such a baseline model may produce homogeneous groups, the emergence of such groups is purely the result of random chance rather than homophily. As such, by comparing the real dataset with this baseline model, we can determine whether homophily exists, and if so, quantify the degree to which it is spread across academia. Figure 1a compares our real dataset with the randomized baseline model in terms of the cumulative distributions of \(d_x^{\mathrm {G}}:x \in \{ {\mathrm {eth,age,gen,aff}}\}\). As can be seen, for x ∈ {eth, gen, aff}, groups with low \(d_x^{\mathrm {G}}\) are more common in reality than would be expected by random chance, highlighting the fact that homophily does indeed exist in academia in terms of ethnicity, gender, and affiliation. However, for x = age, the opposite was observed (see Supplementary Figures 3–6 for subfield-specific distributions). These observations persist, regardless of the publication year (Fig. 1b), and the number of authors per paper (Fig. 1c). The temporal trends observed in Fig. 1b are particularly intriguing. For \(d_{{\mathrm {eth}}}^{\mathrm {G}}\), while the population of scientists is becoming more ethnically diverse (see the steady increase in the red line), this trend is not reflected in the actual coauthor groupings, implying that ethnic homophily is steadily increasing. For \(d_{{\mathrm {age}}}^{\mathrm {G}}\), the actual level of diversity is greater than would be expected by random chance; this pattern is regularly observed in academia, e.g., consider the many publications resulting from advisor–advisee collaborations. For \(d_{{\mathrm {gen}}}^{\mathrm {G}}\), although gender homophily continues to exist, it steadily decreases over time, suggesting that women are playing an ever greater role in scientific endeavors. Finally, for \(d_{{\mathrm {aff}}}^{\mathrm {G}}\), there is a marked decrease in affiliation homophily around the 1990s; this is consistent with the jump in multi-university collaborations in the 1990s due to the widespread of the Internet and other technologies that facilitate collaboration across geographically distant scientists30.

Fig. 1 Exploring homophily in real vs. randomized data. Each column corresponds to a different class of diversity, and each row presents the results of a specific set of experiments whereby \(d_x^{\mathrm {G}}:x \in \{{\mathrm { {eth,age,gen,aff}}}\}\) in real data is compared against randomized data. a Cumulative distributions of \(d_x^{\mathrm {G}}\). b Change in mean diversity \(\langle d_x^{\mathrm {G}}\rangle\) over time. c Mean diversity \(\langle d_x^{\mathrm {G}}\rangle\) for papers with different number of authors Full size image

The link between diversity and scientific impact

Having explored homophily in academia, we now study the effects of homophily (and diversity) on research impact, measured by the number of citations received within 5 years of publication, denoted by \(c_5^{\mathrm {G}}\) (see Supplementary Note 4 and Supplementary Figure 7). Using the same dataset and notation described earlier, we study the relationship between a subfield’s diversity and its academic impact. Here, we distinguish between two notions of diversity. The first is where the unit of analysis is a paper’s set of authors, while the second is where the unit of analysis is an individual scientist’s entire set of collaborators. We refer to the former as group diversity, and to the latter as individual diversity; see Fig. 2 for an illustration comparing the two notions.

Fig. 2 Group vs. individual diversity. For any given class of diversity, x ∈ {eth, age, gen, dsp, aff}, differences in color represent differences in terms of x. The group diversity index \(d_x^{\mathrm {G}}\) of Paper A is higher than that of Paper B. The individual diversity index of Scientist C is higher than that of Scientist D Full size image

For each subfield, Fig. 3a depicts the mean group diversity indices, \(\langle d_x^{\mathrm {G}}\rangle :x \in \{ {\mathrm {eth,age,gen,dsp,aff}}\}\), against the mean 5-year citation count, \(\langle c_5^{\mathrm {G}}\rangle\), taken over papers in that subfield (notation summary and formal definitions are in Supplementary Table 1 and Supplementary Note 2, respectively). Remarkably, we find that a subfield’s ethnic diversity is the most strongly correlated with impact (r = 0.77); the positive correlation persists even when the subfields are studied in isolation (Supplementary Figures 8 and Supplementary Table 2), regardless of the number of authors per paper (Supplementary Figure 9). These findings are further supported by the regression analysis in Table 1. While these findings do not imply causation, it is still suggestive that one can largely predict scientific impact based solely on average ethnic diversity, especially given that ethnicity is arguably unrelated to technical competence.

Fig. 3 Group and individual diversity vs. impact in each subfield. In each subplot, the points correspond to subfields, the color indicates the main field, while the solid line and the shaded area represent the regression line and the 95% confidence interval, respectively. Each regression has also been annotated with the corresponding Pearson’s r and p values. a For each subfield, the subplots depict the mean group diversity indices, \(\langle d_{{\mathrm {eth}}}^{\mathrm {G}}\rangle\), \(\langle d_{{\mathrm {age}}}^{\mathrm {G}}\rangle\), \(\langle d_{{\mathrm {gen}}}^{\mathrm {G}}\rangle\), \(\langle d_{\mathrm {{dsp}}}^{\mathrm {G}}\rangle\) and \(\langle d_{\mathrm {{aff}}}^{\mathrm {G}}\rangle\), against the mean 5-year citation count, \(\langle c_5^{\mathrm {G}}\rangle\), taken over papers in that subfield. b For each subfield, the subplots depict the mean individual diversity indices, \(\langle d_{\mathrm {{eth}}}^{\mathrm {I}}\rangle\), \(\langle d_{\mathrm {{age}}}^{\mathrm {I}}\rangle\), \(\langle d_{\mathrm {{gen}}}^{\mathrm {I}}\rangle\), \(\langle d_{\mathrm {{dsp}}}^{\mathrm {I}}\rangle\) and \(\langle d_{\mathrm {{aff}}}^{\mathrm {I}}\rangle\), against the mean 5-year citation count, \(\langle c_5^{\mathrm {I}}\rangle\), taken over scientists in that subfield Full size image

Table 1 Regression analyses of diversity classes on academic impact Full size table

Having studied group diversity, we now move our attention to individual diversity. Here, we analyze scientists with at least 10 collaborators each, amounting to a total of 5,103,877 collaborators over 9,472,439 papers (see Supplementary Table 3 for a summary of all filters applied on the dataset). For each subfield, Fig. 3b depicts the mean individual diversity indices, \(\langle d_x^{\mathrm {I}}\rangle :x \in \{ {\mathrm {eth,age,gen,dsp,aff}}\}\), against the mean 5-year citation count, \(\langle c_5^{\mathrm {I}}\rangle\), taken over scientists in that subfield. As can be seen, a subfield’s ethnic diversity is again the most strongly correlated with impact (r = 0.55), even when the subfields are studied in isolation (Supplementary Figure 10 and Supplementary Table 4).

The above results highlight a potential dysfunction. While homophily was observed for ethnicity, affiliation and gender, the only attribute for which it was found to be increasing over time was ethnicity, which seems strange given the apparent preeminence of ethnic diversity. Motivated by this observation, we further explore the relationship between ethnic diversity and scientific impact in the randomized universe used earlier in Fig. 1. Recall that, in such a universe, ethnicity is excluded as a criterion for selecting co-authors while the other factors are preserved. Hence, it stands to reason that any differences in impact between the randomized and real datasets can be attributed to ethnic diversity. To examine these differences, we partitioned the papers into two categories, labeled as diverse \(\left( {d_{\mathrm {{eth}}}^{\mathrm {G}} > \tilde d_{\mathrm {{eth}}}^{\mathrm {G}}} \right)\) and non-diverse \(\left( {d_{\mathrm {{eth}}}^{\mathrm {G}} \le \tilde d_{\mathrm {{eth}}}^{\mathrm {G}}} \right)\), where the tilde denotes the median. The scientists were similarly partitioned into diverse \(\left( {d_{\mathrm {{eth}}}^{\mathrm {I}} > \tilde d_{\mathrm {{eth}}}^{\mathrm {I}}} \right)\) and non-diverse \(\left( {d_{\mathrm {{eth}}}^{\mathrm {I}} \le \tilde d_{\mathrm {{eth}}}^{\mathrm {I}}} \right)\). We find that the diverse consistently outperforms the non-diverse, regardless of the year of publication (Fig. 4e), the number of authors per paper (Fig. 4g), and the number of collaborators per scientist (Fig. 4i). We replicated these plots using the randomized, instead of the real, dataset (Fig. 4f, h and j). As can be seen, the performance gap between the diverse and non-diverse almost entirely disappears in the randomized dataset, suggesting that the observed impact gains in the real dataset could indeed be attributed to ethnic diversity. Note that, in the real dataset, a large proportion of papers have \(d_{\mathrm {{eth}}}^{\mathrm {G}} = 0\) (see Fig. 4a), and a large proportion of scientists have \(d_{\mathrm {{eth}}}^{\mathrm {I}} = 0\) (see Fig. 4c). As such, the observed performance gap between the diverse and the non-diverse could be predominantly due to these papers and scientists being less impactful than their counterparts whose \(d_{\mathrm {{eth}}}^{\mathrm {G}} > 0\) and \(d_{\mathrm {{eth}}}^{\mathrm {I}} > 0\), respectively. To determine whether this is the case, we replicated the analysis of papers but after excluding those with \(d_{\mathrm {{eth}}}^{\mathrm {G}} = 0\), and likewise replicated the analysis of scientists but after excluding those with \(d_{\mathrm {{eth}}}^{\mathrm {I}} = 0\); see Supplementary Figure 11. As can be seen, even after this exclusion, the diverse mostly outperform the non-diverse, regardless of publication year, number of authors per paper, and number of collaborators per scientist.

Fig. 4 The relationship between ethnic diversity and impact. a Distribution of \(d_{\mathrm {{eth}}}^{\mathrm {G}}\) in real data. Papers were partitioned into two categories: diverse (highlighted in the darker tones, with \(d_{\mathrm {{eth}}}^{\mathrm {G}} > \tilde d_{\mathrm {{eth}}}^{\mathrm {G}}\)) and non-diverse (highlighted in the lighter tones, with \(d_{\mathrm {{eth}}}^{\mathrm {G}} \le \tilde d_{\mathrm {{eth}}}^{\mathrm {G}}\)), where the tilde denotes the median. b The same as (a), but for randomized data. c and d The same as (a, b), respectively, but with \(d_{\mathrm {{eth}}}^{\mathrm {I}}\) instead of \(d_{\mathrm {{eth}}}^{\mathrm {G}}\). e \(\langle c_5^{\mathrm {G}}\rangle\) against publication year in real data. f The same as (e), but for randomized data. g \(\langle c_5^{\mathrm {G}}\rangle\) against number of authors per paper in real data. h The same as (g), but for randomized data. i \(\langle c_5^{\mathrm {I}}\rangle\) against number of collaborators per scientist in real data. j The same as (i), but for randomized data Full size image

Inferring causality

To provide further evidence of the link between ethnic diversity and scientific impact, we use coarsened exact matching34, a technique typically used to infer causality in observational studies35. Specifically, it matches the control and treatment populations with respect to the confounding factors identified, thereby eliminating the effect of these factors on the phenomena under investigation. In our case, when studying group ethnic diversity, the treatment set consists of papers for which \(d_{\mathrm {{eth}}}^{\mathrm {G}} > P_{100 - i}\left( {d_{\mathrm {{eth}}}^{\mathrm {G}}} \right)\), and the control set of papers for which \(d_{\mathrm {{eth}}}^{\mathrm {G}} \le P_i\left( {d_{\mathrm {{eth}}}^{\mathrm {G}}} \right)\), where \(P_i\left( {d_{\mathrm {{eth}}}^{\mathrm {G}}} \right)\) denotes the ith percentile of \(d_{\mathrm {{eth}}}^{\mathrm {G}}\). This process is repeated using i = 10, 20, 30, 40, 50, corresponding to progressively larger gaps in ethnic diversity between the two populations. Thus, if ethnic diversity is indeed associated with increased scientific impact, we would expect to find a significant difference in impact between the two populations, and expect this difference to increase in tandem with the aforementioned gap in diversity. The confounding factors identified were the year of publication, number of authors, field of study, authors’ impact prior to publication, and university ranking. The same process was carried out for individual ethnic diversity, for which the confounding factors were academic age, number of collaborators, discipline, and university ranking; see Supplementary Note 5 and Supplementary Figures 12 and 13 for more details, and Supplementary Figure 14 for an illustration of how this process works on a given collection of papers. The results for group and individual ethnic diversities are summarized in Tables 2 and 3, respectively. As can be seen, increasing the diversity gap between the control and treatment populations is often accompanied by a greater difference in scientific impacts between the two populations. Remarkably, in the case of papers and scientists above the 90th percentile, the difference in scientific impact reaches 10.63% and 47.67%, respectively, compared to their counterparts below the 10th percentile. Clearly, these results do not suggest that diversity is the only causal factor. For example, one may argue that highly ranked universities tend to attract students from around the world and are more ethnically diverse as a result; indeed we verified that this was the case (see Supplementary Note 6 and Supplementary Figures 15 and 16). In such situations, coarsened exact matching is particularly useful precisely because it allows us to establish causality despite such effects.

Table 2 Coarsened exact matching of group ethnic diversity Full size table

Table 3 Coarsened exact matching of individual ethnic diversity Full size table

Interplay between group and individual ethnic diversity

Finally, we investigate the interplay between group ethnic diversity, \(d_{\mathrm {{eth}}}^{\mathrm {G}}\), and individual ethnic diversity, \(d_{\mathrm {{eth}}}^{\mathrm {I}}\). To this end, for each of the 1,045,401 papers in our dataset, we calculate \(d_{\mathrm {{eth}}}^{\mathrm {I}}\) averaged over the authors in that paper; we denote this as \(\left\langle {d_{\mathrm {{eth}}}^{\mathrm {I}}} \right\rangle _{\mathrm {{paper}}}\). This allows us to study the ways in which the two notions of diversity vary in the same paper. Indeed, as illustrated in Fig. 5, a paper can have high \(d_{\mathrm {{eth}}}^{\mathrm {G}}\) and at the same time have low \(\left\langle {d_{\mathrm {{eth}}}^{\mathrm {I}}} \right\rangle _{\mathrm {{paper}}}\), and vice versa. With this in mind, we studied the impact, \(\left\langle {c_5^{\mathrm {G}}} \right\rangle\), of papers falling in different ranges of \(d_{\mathrm {{eth}}}^{\mathrm {G}}\) and \(\left\langle {d_{\mathrm {{eth}}}^{\mathrm {I}}} \right\rangle _{\mathrm {{paper}}}\); see the matrix at the bottom-right corner of Fig. 5. Here, if we denote this matrix by A, and label the bottom row and leftmost column as 1, we find that \(\mathop {\sum}

olimits_{i = 1}^4 A_{i,1} < \mathop {\sum}

olimits_{i = 1}^4 A_{1,i}\) and \(\mathop {\sum}

olimits_{i = 1}^4 A_{i,4} > \mathop {\sum}

olimits_{i = 1}^4 A_{4,i}\). Hence, while it appears that both group and individual diversities can be valuable, the former seems to have a greater effect on scientific impact. In other words, having co-authors who are inclined to collaborate across ethnic lines (i.e., co-authors whose individual ethnic diversity is high) appears to be not as important as the mere presence of co-authors of different ethnicities (i.e., co-authors whose group ethnic diversity is high).