Leading countries in global science increasingly receive more citations than other countries doing similar research

To capture citational lensing, we represent science as a multiplex network, L, with three layers (Fig. 1). Consider the simple case of citational lensing in a single field in a given year t. Lcitation is the citation network between countries, where \(\mathbfL_\mathrmcitation_i,j\) contains the citation flow from country i to country j. To make things comparable across the layers of the multiplex network, Lcitation is constructed as the number of citations received by country i’s papers published in the given field in year t by all other countries j, where we use a five-year window after publication year t to capture all citations from countries j from year t to year t + 5. In that way, the text network based on published papers in year t corresponds to the citation network of the number of cumulative citations received over the ensuing five years by papers published in year t. We use z-scores for the edge weights rather than the raw citation counts themselves.

Another layer, Ltext, is a network where each connection \(\mathbfL_\mathrmtext_i,j\) is the similarity of the text of country i’s research output to that of country j. To capture the degree of similarity, we apply a unique supervised topic model called a labelled LDA model9. Using the nationalities of authors on papers, the NL-LDA model is unique in that it captures the extent to which ideas and concepts embodied by n-grams in the texts are associated to authors from which countries. This approach is useful to disentangle and establish what is being studied in different countries, as many papers are increasingly authored by researchers from different countries. The KLD8 is taken for the similarity between countries in the text of their scientific papers. In our case, the KLD measures how much information is lost going from the text of one country i’s scientific output to that of another country j.

The reasoning here is similar to that used in other work in the science of science12. Information loss imitates the amount of work that scholars have to do to communicate their ideas. When very little information is lost, communication is seamless; when lots of information is lost, communication is difficult. Note that this is not a symmetrical relationship, and that is by design. The \(\mathbfL_\mathrmtext^\mathrmT\) layer tends to identify the most common subject matter in national research, so when information is lost in moving from country i to country j, it indicates that researchers in country i publish about some topics that researchers in country j do not (though this is, of course, usually a matter of degree rather than an issue of presence and absence). This means that it is harder on average for a scientist in country i to find a counterpart in country j that is working along a similar line of research than it is for a scientist in country j to find a someone working on similar problems in country i. This also means that it is easier to find a paper from country j that cites a paper from country i than it is to find the reverse, assuming that citations are more likely when two papers have the same subject matter.

In principle, when the information loss is high going from i to j, we say that the similarity of i to j is low. When very little information is lost going from i to j, we say that the similarity of i to j is high. Just as in the citation layer, z-scores are used for edge weights, the only difference being that we take the negative here in the text layer, as high information loss implies exactly the opposite relationship that high citations imply. So, when we compare the multiplex layer \(\mathbfL_\mathrmcitation_i,j\), which measures citational flow from country i to country j, with \(\mathbfL_\mathrmtext_i,j\), which captures how similar country i is to country j, we use the transpose of Ltext to result in \(\mathbfL_\mathrmtext^\mathrmT\), where the similarity of country j to country i given as \(\mathbfL_{\mathrmtext_i,j}^\mathrmT\) is equivalent to \(\mathbfL_{{{\mathrmtext}}_j,i}\). \(\mathbfL_{{{{\mathrmtext}}}}^{\mathrmT}\) is used in equation (1). We use this transpose because the more researchers in country j cite researchers in country i, we posit that the work produced by researchers in country j (that is, who is doing the citing and thus is giving the attention to the work being done in country i with their citations) ought to be more similar to the work produced in country i (that is, who is being cited and receiving the attention from country j). Distortions thus ought to reflect either over-recognition or under-recognition via attention (vis-à-vis citations) relative to the work being done elsewhere.

The third layer, Ldistortion, is what we call the citational well (drawing on the idea of gravity wells). This layer is constructed so that it will capture the difference between the other two layers, as given by equation (1) and Fig. 1. This means that every \({{\mathbfL}}_{\mathrmdistortion_i,j}\) represents the distortion in the citation flow from country i to country j, relative to what we would expect on the basis of the similarity of the text written by country i’s scientists to that written by scientists in country j. Also implied is that the sum of the distortion for country j relative to every other country in the networkcountry j’s in-degree in Ldistortion—represents the total distortion in the citation flow to country j.

To illustrate how citational lensing can be applied, we use nearly 20 million academic papers in nearly 150 fields and subfields from 1980 to 2012 in MAG, one of the most extensive metadata repositories of academic publications. These data include metadata such as citations, along with the abstract text of published research articles. We show how citational lensing can be used to characterize changes in the international scientific hierarchy over time and how it can be scaled to cover all of science.

MAG classifies journals into various fields, which provides a fairly reliable reflection of disciplinary boundaries and allows for selection across a wide variety of fields. MAG uses a six-tiered field classification ID scheme that is human generated for the highest two levels. We primarily use the second-highest level, which offers these more granular field divisions. So instead of using just ‘physics’, we consider ‘astrophysics’ and ‘nuclear physics’ to be their own fields because they have different citation practices. Fields are identified and defined for our purposes as their field IDs in MAG, and the fields are itemized in Supplementary Table 1. We classify these fields into four broad categories: (1) biomedical, behavioural and ecological sciences; (2) engineering and computational sciences; (3) physical and mathematical sciences; and (4) social sciences. We use no other sort of field normalization. The population of journals in MAG increases considerably over time, which may partly affect the representation of countries in our analysis.

Lcitation is assembled using the citation data in MAG. As mentioned above, each \({{{\mathbfL}}}_{\mathrmcitation_i,j}\) holds the citation flow from country i to country j. Because citation inflation14 distorts the volume of cumulative citations in a field over time, rendering temporal comparisons biased, we standardize and ‘deflate’ the number of citations received in years t + n to the equivalent number of citations that would have been received in the year t that the paper was published. In essence, the citations received in a future t + n year are converted into an exchange rate based on the year the paper was published t, rendering comparing citations across time less biased by volume. (In Supplementary Figs. 10–15, we rebuild our main figures comparing the citation deflation method that we use here to two other conditions: one that does not include any deflation and another that employs our own deflation method focusing specifically on countries.)

Ltext is constructed using text from the abstracts and titles of each paper. This has advantages over using the full texts of research papers, since some fields format papers to emphasize methods over theory or vice versa, and others might have a strict length criterion, in terms of word count or page length. Abstracts, however, succinctly summarize the most important concepts in a paper. We restrict our analysis to papers with English-only abstracts. (In Supplementary Figs. 22–27, we rebuild our main figures comparing these English-only abstracts to those that were subsequently translated from their original language into English by us using Google Translate.)

We build both \({{{\mathbfL}}}_{{{{\mathrmcitation}}}_{{i},{j}}}\) and Ltext using only those journals that have existed in our data since 1980, the starting point of our analyses. (In Supplementary Figs. 16–21, we rebuild our main figures including all journals irrespective of their tenure in the data.) The important terms and phrases that represent ideas, concepts and phenomena need to be efficiently extracted from abstract texts. So, we construct each field’s corpus in year t as a combination of unigrams, bigrams and trigrams from every document’s abstract, referred to as Fieldt. For our analyses here, we use English-only abstracts to mitigate the risk of mistranslation. We also translate non-English abstracts using a Python module called googletrans that functions as an API with Google Translate and reconstruct our analyses, but our conclusions are consistent with what we present here. We apply a phrase extraction algorithm called RAKE (Rapid Automatic Keyword Extraction) to each abstract to extract all important phrases and terms from unigrams through trigrams35. RAKE extracts terms and phrases from abstracts by analysing the frequency of each n-gram and its co-occurrences with other n-grams in the text. An advantage of RAKE over other approaches is that it is domain independent, so it does not rely on a pretrained corpus to identify what terms are important. We then compiled an ‘academic stop word’ list of common phrases used in academic writing based on Coxhead36 and removed them from the abstracts.

KLD compares probability distributions. To process the text of scientific articles so that each country has its own probability distribution, we apply NL-LDA models on abstracts from MAG publication abstracts to measure how similar or dissimilar the phenomena studied by researchers in different countries are2,3,4. We apply an NL-LDA model to each Fieldt corpus. This approach parses the influence of countries on multi-authored, international papers, a staple of many fields. We measure how similar individual countries’ unique national signatures—or how strongly associated the terms found in a field’s corpus in a year are associated to researchers in some country x—are to one another. The NL-LDA produces a matrix, \(\varphi _\mathrmField_t\), where the rows are the n-grams in the corpus for Fieldt defined as wm and the columns are the national signatures defined as Cn. We standardize each national signature (column) in \(\varphi _\mathrmField_t\) such that for each national signature, we assign zero values to all terms that were not present in papers authored from a particular country. (Our implementation of the NL-LDA model assigns a very small non-zero value to all terms that are not present in documents with a particular nation-label but are present in Fieldt.) As the national signatures sum to 100%, we then renormalize each national signature after we convert the associative probabilities of absent terms to zero so that the national signature still sums to 100%.

We first validate the quality of the nation-labels produced by the NL-LDA using topic cohesion scores, the standard measure for how distinct a topic is from other topics derived from the same model. A cohesive topic forms a distinctive grouping of its top n-grams that differentiates it from other topics. However, to date, no equivalent approach exists to measure nation-label cohesion for a supervised model like the NL-LDA in the same way as topic cohesion does for unsupervised models like the LDA. This is because the number of appropriate topics extracted from an LDA is variable and somewhat subjective, but the NL-LDA nation-labels are nominally fixed. That said, not every country may produce enough published papers in a year to produce meaningful results, so including every country in our analyses without any filtering may not be prudent. We apply the umass topic cohesion measure to the nation-labels in each NL-LDA model, where we compare the document co-occurrences of each nation-label’s top 25 strongest associated terms from its national signature. Whereas with unsupervised LDA models, lower scores indicate more distinct and cohesive topics, with NL-LDA models, the opposite holds true: nation-labels with strong national signatures lead the way in global science and have lexical usage that is more widespread throughout the field. For each NL-LDA model, we convert these scores into percentile ranks, where the nation-labels that are the most ubiquitous (such as the United States and in later years China) are in the highest percentile (that is, they have lower coherence scores) and less active countries are in the lowest percentile (that is, they have higher coherence scores). For the results presented here, all of the nation-labels are included in the analyses. In Supplementary Figs. 1–7, we rerun nearly all of the figures presented here at the 25th and 75th percentiles. Our results broadly hold despite the exclusion of nation-labels.

With these matrices, we measure how similar any country’s subject matter is to that of all other countries for some Fieldt. However, a standard similarity score (like a cosine similarity) is not directed, and our aim is to understand how much one country looks like another when reciprocation may not happen. We compare every country to every other country in \(\varphi _\mathrmField_t\) and take the KLD of every column in \(\varphi _\mathrmField_t\) to every other column, where each comparison is a weighted, directed link that creates an international network of asymmetric text similarity. To calculate this score, we take the two vectors for a country i and another country j, presented as their national signature vectors ci and cj, respectively, to determine how similar they are to each other:

$$\mathrmKLD(\mathbfc_i \parallel \mathbfc_j ) = \sum \mathbfc_i \log \frac\mathbfc_i \mathbfc_j $$

(2)

Here KLD measures how much information is lost by national signature ci when approximated with the national signature from cj. In other words, the less information that is lost by approximating ci with cj, the more similar cj is to ci. From here, we construct 4,914 international networks of topic similarity across nearly 150 academic fields and 33 years of data (that is, 1980 to 2012), defined as \(\mathbfKLD_\mathrmField_t\) (referred to in the results as \({{{\mathbfL}}}_{{{{\mathrmtext}}}}^{{{\mathrmT}}}\)).

We create an upper bound for KLD in the following way: for each KLD network, \({\mathbfKLD}_{{\mathrmField}_t}\), we take the negative of its z-score, so that the lowest value (that is, the lowest information loss and the most similar country dyad) is normalized to be the largest value relative to all other edge weights in the network (in terms of standard deviations). The dyad with the lowest raw KLD score is thus the dyad where the least amount of information is lost by approximating ci with cj, so that country i is highly aligned with country j. This approach is advantageous as it renders comparison across networks possible, particularly for extreme values.

Statistics and reproducibility

Our analyses were observational, and no statistical method was used to predetermine sample size.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.