Progress of Studies of Citations and PageRank Progress of Studies of Citations and PageRank

A number of citations have been used to measure the value of paper. However, recently, Google’s PageRank is also extensively applied to quantify the worth of papers. In this chapter, we summarize the recent progress of studies on citations and PageRank. We also show our latest investigations of the citation network consisting of 34,666,719 articles and 591,321,826 citations. We propose the generalized beta distribution of the second kind to explain the distribution of citation and introduce the stochastic model with aging effect and super preferential attachment. Furthermore, we clarify the positive linear relation between citations and Google’s PageRank. By using this relationship as the benchmark to classify papers, we extract extremely prestigious papers, popular papers, and rising papers.


Introduction
Citation analysis has a long history. Recently, Hou [1] applied the new method called the reference publication year spectroscopy (RPYS) to 2543 papers including 56,392 references regarding citation analysis in Science Citation Index Expand (SCI-E) and Social Science Citation Index (SSCI) data from 1970 to July 2016. This investigation clarified that the development of citation analysis is divided into five periods : before 1990, 1901-1950, 1951-1970, 1971-2000, and 2001-2016. In this chapter, we focused on the distribution of citations which were introduced by Price [2] and extensively investigated in the third period, that is, 1950s-1970s. In this chapter, we consider that the number of citations expresses the popularity of papers.
The fifth period, that is, 2001-2016, is characterized by a period of rapid expansion and diversified directions. In this period, many conceptions have been introduced, for example, scientific evaluation indices, citation networks, information visualization, and citing behaviors. A variety of new impact measures has been proposed based on social network analysis in sociology and of network science originated from physics, mathematics, and information science. Bollen [3] summarized 39 impact measures and investigated the correlation between them by using the principal component analysis. Then, Bollen [3] indicated that the notion of scientific impact is a multidimensional construct that cannot be adequately measured by any single indicator, although some measures are more suitable than others.
In this chapter, we focus on the Google's PageRank which is first proposed by Brin and Page [4] to obtain the list of useful web pages for queries by users. Thus, if we define the usefulness of web page as the number of links cited by the other web pages, the search engine should propose the list of portal sites, that is, popular web pages. Hence, this list is useless for web users. To overcome this problem, based on the concept of vote, Brin and Page [4] defined the usefulness of web pages as the number of votes from the linking web pages. In the algorithm of Google's PageRank, the number of ballets is proportional to the usefulness of the web page, that is, the useful web page has many ballets. As a result, the useful web page collects votes from the useful web pages. Thus, the Google's PageRank expresses the prestige of web pages. We consider that this characteristic of Google's PageRank is valid for the case of citation network.
This chapter is organized as follows. In Section 2, we explain characteristics of dataset used in this chapter. The distribution of citation and the stochastic model of citation network are elucidated in Section 3. In Section 4, we introduce Google's PageRank and calculate it. We consider the correlation between citation and PageRank in Section 5. Section 6 is devoted to conclusions.

Data
In this chapter, we use Science Citation Index Expand (SCI-E) provided by Clarivate Analytics Co., Ltd. This dataset contains bibliographic information of scientific papers published from 1900 to the present. However, due to limited research budget of authors, we use the dataset from 1981 to 2015 in this chapter. This dataset contains 34,666,719 papers and 591,321,826 citations.
In this chapter, we denote the number of papers published in the year t as n ( t ) . Figure 1 depicts the change of n ( t ) . In this figure, n ( t ) almost monotonically increased from 1981 to 2013 and decreased after 2013. However, this behavior of n ( t ) is fake. This is because the dataset was made at the beginning of 2016 and it partially contains papers published in 2014 and 2015. It takes a few years for all the papers to be included in SCI-E.
If we consider papers as nodes and regard citations from a citing paper to a cited paper as directed links, we can consider the dataset of citations as a directed network. We call such a network as the citation network. The citation network consists of many connected components. We denote the number of nodes contained in connected components as c and represent a frequency of c as F (c) . Figure 2 depicts F (c) . We can find that there is the largest connected component. This largest connected component consists of 34,428,322 nodes which are 99.3% of the total number of papers contained in the dataset, and of 591,177,607 links which are 99.98% of the total number of citations contained in the dataset. In the following section, we focus on the largest connected component.

Distribution and dynamics of citations
In this chapter, we argue for the distribution of the citations and stochastic models which lead to the citation network.

Distribution
The number of citations is represented by the number of in-degree, k , of the corresponding nodes. Figure 3 is a double-logarithmic scale plot of the rank size distribution, R ( k ) , of citations. The right-tail part of the distribution decreases almost monotonically. This means that this part follows a power-law distribution, that is, R ( k ) ∝ k −μ . Here, the exponent μ is called Pareto exponent originated in the name of Italian economist Vilfredo Pareto. The dashed line in Figure 3 is the reference line which is the power law distribution with μ = 2 , that is, Pareto [5] first investigated the fat-tail behavior of the right-tail part of personal income and wealth distributions. After Pareto, many types of distribution functions have been mainly proposed in the field of economics, especially in the investigation of personal income distribution (e.g., see [6,7]). On the other hand, in the field of scientometrics, Price [2] first applied the power law distribution to the citation network and found that the distribution of the number of citing (the number of out-going degree in terms of network science) follows the power law distribution with μ = 1 and that of the number of citations (the number of incoming degree in terms of network science) obeys the power law distribution with μ = 1.5 or μ = 2 . The latter result is same as the reference line in Figure 3.
Rednar [8] investigated papers published in 1981 and cataloged by the Institute for Science Information (783,339 papers) and 20 years of publications in Physical Review D, vols. 11-50 (24,296 papers) and found that the right-tail part of both distributions of citation follows the power law distribution with μ = 2 . This result is same as Price [2] and the reference line in  Albarrán and Ruiz-Castillo [10] studied 5 years (1998)(1999)(2000)(2001)(2002) of publications in Web of Science (3.7 million papers) and found that the power law distributions of the right-tail part of the distribution of citation are not rejected for 17 of the 22 scientific fields of Web of Science. Albarrán et al. [11] investigated same dataset of Albarrán and Ruiz-Castillo [10] and found that the power law distributions of the right-tail part of the distribution of citation are not rejected for 140 of the 219 scientific sub-fields of Web of Science. Recently, Brzezinski [12] investigated scientific papers published between 1998 and 2002 drawn from Scopus and found that the power law hypothesis is rejected for half of the Scopus field of science.
Although there are many researches besides the studies stated above, there are no studies that used vast amounts of data to approach the overall picture of citation distribution, like this chapter. The light gray line in Figure 3 is the best fit by the generalized Beta distribution of the second kind (GB2) (or called the beta prime distribution) (e.g., see [13,14]) with the probability density function: with a = 0.7 , b = 15.2 , μ = 2.0 , ν = 3.0 . Here, B ( μ, ν ) is the Beta function.

Stochastic models
Simon [15] proposed the stochastic model, the so-called Simon's model, to elucidate the empirical distributions: distribution of words in prose samples by their frequency of occurrence, distributions of scientists by number of papers published, distributions of cities by population, distributions of income by size, and distributions of biological genera by number of species. Although assumptions of Simon's model are written in terms of word frequencies, we can express them in terms of network science as follows: assumption I-The probability that a node gets new link is proportional to the number of its degrees, that is, rich get richer or Matthew effect (e.g., see [16]), and assumption II-We add a new node with a constant probability γ . Simon's model elucidates the fact that the right-tail part of the distribution follows the power law distribution with μ = 1 / (1 − γ) .
Price [17] generalized Simon's model, the so-called Price's model, to explain the growth of the citation networks. Barabáshi and Albert [18] introduced the stochastic model, the so-called BA model, based on two concepts: preferential attachment and growth, which corresponds to assumptions I and II of Simon's model, respectively. BA model is the case of γ = 1 / 2 of Simon's model and derives the power law distribution with μ = 2 . Jeong et al. [19] extended BA model to include an aging effect and a class of homogeneous connection kernels. Golosovsky and Solomon [20,21] further extended to include an effect of initial attractivity.
Here, we use the model proposed by Jeong et al. [19] and check the aging effect and homogeneity of the growth of citation network. If we denote the number of degree of node i as k i , the time evolution of k i is obtained by Here, A i ( t ) is an aging factor and α > 0 is an unknown scaling exponent. Krapivsky et al. [22] have shown, for the case without the aging factor, for α = 1 (linear preferential attachment) the model is just same as BA model and derives the power law distribution with μ = 2 . For α < 1 , the model derives the stretched exponential distribution, and for α > 1 (super preferential attachment) a single node connects to nearly all other nodes, akin to gelation.
If we discretize the model and consider Δt = 1 year, Eq. (2) is written by The red and solid line in the right panel of Figure 4 corresponds to the linear regression of red dots by Eq. (4). The slope of this line corresponds to α and the intercept of it corresponds to A t . In Figure 4, blue, green, and magenta dots are analysis for the year 1993, 2003, and 2010, respectively.
The left panel of Figure 5 Figure 5 depicts the change of α . This figure shows that α > 1 for the entire period in which we investigated. From this analysis, we realize that the citation network has the characteristics of super preferential attachment; therefore, it is expected that a single node connects to nearly all other nodes. However, the aging effect prevents the citation network from an oligopolistic network.

Distribution of PageRank
Google's PageRank is proposed by Brin and Page [4]. The Google number, G i , of paper i is defined by the recursion formula (from Chen et al. [23]): Here, N = 34428322 is the total number of articles contained in the largest connected component of the citation network. The sum is over the neighboring nodes j in which a link points to node i . In Eq. (5), d is a free parameter that controls the convergence and effectiveness  of the recursion calculation. In the original Google's PageRank [4], d = 0.15 is adopted and appropriate for the case of world wide web. On the other hand, d = 0.5 is adopted in [23] and appropriate for the case of citation network.  Table 2 depicts the top 20 lists of the Google's PageRank. The characteristics of this list are that papers belong to many subjects and that the publication years of papers are relatively old.

Correlation between citation and PageRank
Bollen and Rodriquez [24] described that the Institute for Scientific Information (ISI) Impact factor (IF) which is defined as the mean number of citations a journal receives over a two-year period is a metric of popularity and that the Google's PageRank is a metric of prestige. This concept is also proposed by Chen et al. [23] and Maslov and Redner [25] which investigated all publications in the Physical Review family of journals from 1893 to 2003 and found the linear relation between the Google number and the number of citations. Furthermore, [23,25] found that some outliers from this linear relation, especially the papers of which the ranking of PageRank is remarkably high and that of citation is slightly high, are universally familiar to physicists [23,25] called such papers scientific "gems." Ma et al. [26] applied the concept of [23][24][25] to the field of biochemistry and molecular biology from 2000 to 2005. Though these studies investigated the citation network of some selected scientific field, this chapter investigates the citation network consisting of all scientific fields. However, there are outliers which have high prestige comparing to popularity. These papers are located above the solid gray line in Figure 8 and are regarded as extremely prestigious papers. If we denote the citation rank as r k and the Google's PageRank as r G , these extremely prestigious papers are extracted by the order of Google's PageRank with the constraint given by the ratio Table 3 depicts the top 20 extremely prestigious papers selected by using the constraint   On the other hand, there are also outliers which have low prestige comparing to popularity. These articles are located below the solid gray line in Figure 8 and are regarded as extremely popular papers. These articles are extracted by the order of citation rank with the constraint given by the ratio r G / r k . Table 4 depicts the top 20 extremely popular papers selected by using the constraint r G / r k > 5 . These articles are divided into two groups. One group contains papers which are published in Nature, Science, and the Proceedings of the National Academy of Science of the United State of America (PNAS). Besides, publication year of these papers are approximately over 10 years ago. Furthermore, the growth rate of citations, k ' / k , of those papers are low. The other group includes papers which are mainly published in Cell and are published relatively recently. What is more, the growth rate of citations, k ' / k , of those papers are extremely high. Thus, we can regard these papers as rising papers.

Conclusions
We investigated papers published from 1981 to 2015 and contained in SCI-E. The total number of papers is 34,666,719 and that of citations is 591,321,826. We extracted the largest connected component from this dataset. The obtained citation network consists of 34,428,322 nodes (articles) and 591,177,607 links (citations).
The right-tail part of the rank size distribution of citations follows the power law distribution with exponent μ = 2 , that is, R ( k ) ∝ k −2 . Furthermore, we introduced the generalized beta distribution of the second kind (GB2) as the best-fit function to the whole range of citation distribution. We introduced the stochastic model with growth, preferential attachment, and aging effect. Through the numerical analysis, we obtained the value of the parameter set.
Although the number of citations represent the popularity of papers, Google's PageRank reflects the prestige of papers. We evaluated Google's PageRank for the largest connected component which consists of 34,428,322 articles and 591,177,607 link citations. We found that the citations and Google numbers have a positive linear relation. We consider this positive linear relation as a benchmark and selected extremely prestigious and extremely popular papers. We found that the subject of extremely prestigious papers is almost information science. Furthermore, we found that extremely popular papers are divided into popular papers and rising papers.
We conclude this chapter by describing two remaining issues. One concerns the stochastic model. Though we introduce GB2 as the best-fit function to the whole range of citation distribution, there is no stochastic model that explains GB2. The other concerns the weight of links in the citation network. Almost all studies have investigated citation networks as unweighted networks. However, it is possible to define weight of links, for example, similarity between papers.