Research focus: The aim of this chapter is to introduce a new citation analysis and service framework based on the semantic web technologies (e.g., ontology and linked data). Research methods: This research project is based on a review of relevant literature and a series of experimental results based on ontology and linked data. Motivation: Traditional citation analysis methods and tools are overly dependent on citation databases, and traditional citation information service may ignore the semantics of knowledge resources and lack ability to store and query data in a machine-readable mode. Findings: The findings underline that the new citation analysis and service system framework based on ontology and linked data are feasible, which can integrate information requirements and knowledge services, and provide users with more personalized and comprehensive services.
- citation analysis
- methods and techniques
- linked data
- citation knowledge service system
Citation analysis is a bibliometric analysis technique which reveals the quantitative characteristics and laws of scholarly publications. It involves the use of mathematical and statistical methods to analyze citations within journals, papers, authors, and other references. Citation analysis has seen substantial theoretical and practical progress over several decades of development and has been widely applied to evaluate scientific knowledge, identify scientific models, and explore new frontiers which being explored by the scientific community. It is of great significance in regard to technological innovation and scientific decision-making. Traditional citation analysis methods and tools are overly dependent on citation databases, which have the following drawbacks:
All citation acts are treated as equally important.
All kinds of statistical indicators are based on specific instances of citation, which are annotated only by the author.
Citation databases can only reveal whether there is a reference shared between different papers but fail to reflect any deeper relationships among semantic citations.
Motivations and behaviors related to citation have been analyzed by researchers from various angles. In 2014, content-based citation analysis method  has also been proposed. In this chapter, we propose a new citation analysis framework based on ontology and linked data; our goal is to enhance the efficacy of citation analysis via semantic web technology.
2. Related work
Berners-Lee, Hendler, and Lassila  published the article “The Semantic Web” in 2001, marking a brand new approach to semantic web research. The World Wide Web Consortium (W3C) later established a series of technical specifications that promoted the further development of the semantic web; specifications such as RDF, OWL, and SPARQL have allowed the application of the semantic web to many research fields and, further, have laid a foundation for knowledge representation, knowledge organization, and information retrieval on the Internet. Ontology is one of the backbones of the semantic web and was widely used to specify standard concept vocabulary for exchanging data between systems, offer suggestions of answering queries, publish reusable knowledge bases, and provide services to facilitate operations across heterogeneous systems and databases . In 2006, Berners-Lee  first proposed the concept of “linked data”, which has since become a wildly popular research topic in the computer science (CS) and library and information science (LIS) fields. Linked data builds associations between objects through the resource description framework (RDF) structure, ultimately revealing the relationships and implicitly shared knowledge between heterogeneous sets of data. After more than 10 years of development, linked data has seen numerous breakthroughs in both theoretical and technical aspects. To date, the linking open data project  has successfully transformed billions of web data points (e.g., Wikipedia, geographic data, government data) into the RDF triples of linked data, creating one massive data network.
In recent years, researchers have begun to introduce semantic web technology to citation analysis in effort to exploit ontology, linked data, and other technologies to improve the description of citation behaviors and motivations. The most representative example is the semantic publishing and referencing (SPAR) ontologies created by Shotton, Portwin, Klyne, and Miles . Citation Typing Ontology (CiTO) is the ontology SPAR used to describe the relationship between citing papers and cited papers; it provides reference information such as background, method, citation type (e.g., journals, books, reports), peer review, and more. CiTO’s citation types include factual relationships and rhetorical relationships. The current version (CiTO 2.4.6) allows authors to describe their citation motivations as references, thus helping to reveal indirect and implicit relationships at work in scholarly literature. Ciancarini et al.  presented an experiment to investigate which are the main difficulties behind CiTO and how the humans understand and adopt CiTO. Iorio et al.  proposed a tool called CiTalO, which could automatically annotate the nature of citations with properties defined in CiTO through the semantic web and NLP techniques. By contrast, Recupero et al.  created SHELDON to extract citation RDF data from text using a machine reader, and CiTO was also used to describe the citation relationship.
Other researchers, for example, Ding, Konidena, Sun, and Chen , have also explored the idea of semantic citation to suggest that individuals can use ontology and linked data to describe bibliographic data and publish it to RDF triples. Mahmood, Qadir, and Afzal  combined semantic web technology with credible citation analysis to establish a framework that provides openness and reliability validation for all stages of the citation behavior lifecycle. The framework requires the use of semantic metadata at all stages of academic publishing to annotate the citation behavior and generate machine-readable RDF triples. This kind of annotation makes author, publisher, database vendor, and citation analysis system work together and build a set of reliable reference information while eliminating any false or misleading citation actions in the literature. More recently, Peroni et al.  experimentally described references in a suitable machine-readable RDF formats to make reference lists freely available to all academics. The open citation corpus  is created to store citation data from open access databases.
Quickly moving into an unfamiliar field for researchers is difficult, due to the mass of scientific articles  that must be reviewed without prior knowledge of their research contents. In a traditional citation information service, the search results are generated by keywords and other information that match specific knowledge resources and the corresponding user’s correspondence. Such a method is simple, but it often ignores the semantic level of the knowledge resources, causing it to miss a significant number of semantic knowledge resources . It may yield search results from a large number of studies that still do not meet the user’s personalized knowledge needs .
In 2001, Aronson  argued that query refinement based on ontology is more efficient than other methods that were available at the time. From the perspective of information organization, ontology is a new method of knowledge organization and processing, and it is also the basis of semantic webs. It can systematize and organize a large amount of relevant information. When applying ontology to information retrieval, it is necessary to apply ontological principles to the information resources, so that search reasoning is implemented by the logical rules contained in the ontology itself, and a high quality retrieval result is output. With respect to the shortcomings of traditional citation information services, the introduction of ontology may help users to improve their searches aimed at multiple citation retrieval. In 2012, Kara, Alan, Sabuncu, Akpınar, Cicekli, and Alpaslan  found that while thesauruses are concerned with meanings at the level of words, ontologies more specifically deal with meanings at the level of real-world entities denoted by words. That is, ontologies deal with the interpretation of words in terms of real-world entities.
In recent years, with the advance of ontology, related studies have revealed that ontology-based knowledge services have been developed in different areas, including personalized medicine , e-government , medicine [21, 22], smart homes , the digital library [24, 25], and so on.
The digital library is an important application area of ontology-based knowledge service research. In 2015, Patkar  indicated that ontology is one of the latest tools for information retrieval from libraries in this digital age. His paper discusses advances in information managing tools and concludes by highlighting the applications of ontology among the different fields.
Koutsomitropoulos, Solomo, and Papatheodorou  studied the semantic search service of the DSpace digital repository system. They argued that Semantic Search v2 introduces a structured query mechanism that makes querying easier and improves the design of the system, performance, and scalability. Queries based on the DSpace ontology were dynamically created, and DSpace was able to obtain structured knowledge from the available metadata. Empirical and quantitative evaluation has shown that such a system can conduct semantic searches that provide better services for inexperienced users, such as the use of new query dimensions, with clear benefits.
In 2015, Iorio and Schaerf  proposed a semantic model defined by the Sapienza Digital Library to describe resource metadata. The semantic model is derived from the metadata object description model (a digital library descriptive standard). A top-level conceptual reference model supports the implementation of semantic web technologies for digital library metadata.
3. Method and process
Any citation analysis method based on ontology and linked data mainly includes the following three steps: first, building citation ontology according to the bibliographic citation data and full-text citation information; second, using the citation ontology to normalize the reference information and publish the data to linked data according to the RDF model; and, third, in order to extract the required citation information, writing a specific SPARQL search query for a citation analysis dimension and executing the search query. The search results are then visualized to reach the citation analysis goals.
3.1. Citation ontology construction
From the perspective of citation analysis, bibliographic citation information and full-text citation information are not only two independent parts but also two important sources of data that are both necessary for citation analysis. Here, we construct the bibliographic citation ontology (BCO) and full-text citation ontology (FCO) based on the bibliographic citation information and the full-text citation information, respectively. This allows us to achieve comprehensive semantic annotation of the citation information at hand.
The most commonly used ontology construction methods are the IDEF-5 , skeletal methodology , KACTUS , TOVE , METHONTOLOGY , and seven-step methodology . The purpose of this study was to construct a task-based ontology to describe citation information, so we choose the seven-step method developed by the Stanford University. The seven steps are (1) defining the domain and category of the ontology, (2) examining the possibility of reusing existing ontologies, (3) listing the important terms in the ontology, (4) defining the hierarchical system of classes, (5) defining the properties of the classes, (6) defining the facets of the properties, and (7) creating the instance. We also use the most popular protégé as our ontology development tool.
The construction of BCO is based on references. From the list of references, information such as the author, periodical, document type, year, volume period, and page number are extracted as the classes of BCO. In order to extend the dimensions of citation analysis, we extend the subclass from the perspective of journal and author. The “reference number” class is also added to the article, and the importance of the reference is measured by the quantities of internal references and external references. For property definitions, we reused the already-existing ontology properties (e.g., “fabio: hasPublicationYear,” “bibo: volume”) and marked the newly added attributes in the form of “bco.” An example of the BCO ontology’s classes and properties is shown in Figure 1.
The construction of FCO begins with three aspects: citation function, citation sentiment, and citation position. The citation function represents the role of cited work to citing work, such as background development, data support, methodology support, extension, or refutation. Citation sentiment expresses the emotion attitude from citing work to cited work, such as positive, neutral, and negative. Citation position indicates the location of the paragraph where the reference behavior occurs, such as the “Introduction” section of the document. An example of the FCO ontology’s classes and properties is shown in Figure 2.
3.2. Publishing linked data of the citation information
By using the citation ontology, we can publish the citation information linked data in the form of RDF triples. We used D2R as the linked data release software for this purpose. D2R is a very popular tool for linked data publication which serves to convert the massive, relational database format data into linked data RDF triples. We then imported the linked data into the semantic repository Virtuoso.
In terms of bibliographic citation data, we use the library, information science, and technology abstracts (LISTA) database as the data source. LISTA is a citation abstract database which contains the structured data of more than 600 core journals and 5000 core authors . We have successfully published these data as linked data to form a strong foundation for subsequent citation analysis.
In the full-text citation information set, the most often-cited papers in the specific field were selected first as the citing work subset. The reference literatures were extracted as the cited work subset. On this basis, quoted sentences in the citing literature and cited literature were extracted, and the citation function, citation sentiment, and citation position information were marked by two trained coders. The full-text citation information was then organized into RDF triples as shown in Figure 3.
3.3. Citation analysis method implementation
The essence of the citation analysis method based on the linked data is to write the corresponding SPARQL query, which can be used to extract the citation information of specific dimensions. The search results are then calculated and visualized to analyze the citations per different dimension. In this chapter, we initially plan to implement 11 dimensions of citation analysis. Among them, citation quantity analysis, citation strength analysis, citation type analysis, citation language analysis, citation country analysis, citation age analysis, citation journal analysis, and co-citation analysis are based on the bibliographic data of traditional citation analysis, while the remaining three dimensions (citation function analysis, citation sentiment analysis, and citation position analysis) are based on a full-text citation analysis perspective.
The citation analysis process (for age and function, as examples) is shown in Figure 7. Citing literatures A and B constitute the citing subset, while references [1, 2, 3, 4, 5, 6, 7] serve as the cited paper subset. The relationship between them is complex and involves many factors. As mentioned above, the citation functions between them have been marked with “cito:extends,” and the age information have been published as linked data. These citation relationships can thus be transformed into RDF triples as shown in Figure 4.
Once the triples are complete, we need to write a specific SPARQL search query to extract the specific citation information as shown in Figure 5.
The first SPARQL query is used to retrieve all the publication year information for the references cited by paper A, and the second query to retrieve all references to reference , which extends the function of document 4. The search results are then calculated and displayed as the final results. Visualization software (e.g., Power BI, Tableau) could also be applied to simplify the display of results, and other dimensions of citation analysis can be implemented according to the same principle. As the quality of data is continually improved, more dimensions of citation analysis can also be achieved in follow-up experiments.
The citation knowledge service system based on ontology introduces ontology-related theory and technology into citation knowledge organization and knowledge retrieval and constructs an ontology-based citation knowledge service system. This system introduces a lightweight cube ontology to organize, store, and query citation knowledge data in a machine-readable mode. It uses domain ontology to express the semantic representation of the citation knowledge base and to associate the citation knowledge data with the domain knowledge. According to user registration information and a user need survey, a user log flow provides users with targeted knowledge to ensure the effectiveness of the knowledge services.
4. Framework for a citation knowledge service system
In the process of creating a citation knowledge service, we construct the citation knowledge base, the lightweight citation ontology base, and the domain ontology base by using ontology and other technologies. We use the ontology to reorganize the citation knowledge unit, organize, store, and query citation data in a machine-readable mode. According to user’s search habits that captured user behavior preferences and knowledge preferences, the system is able to understand user needs and establish a matching knowledge discovery mechanism .
This chapter presents a framework of an ontology-based citation knowledge service system, which contains four core layers, a data resource layer, an ontology layer, a semantic association layer, and a functional layer, as shown in Figure 6.
4.1. Data resource layer
The data resource layer is at the bottom of the knowledge base and contains the citation knowledge base and the user database.
The citation knowledge base provides data protection used for the construction of the domain ontology base, knowledge retrieval, knowledge recommendation, and other knowledge services. It stores the information about the citation resources gleaned during the knowledge acquisition. In addition to providing the user with the basis of the information, the citation knowledge base also contains other relevant information, such as the authors’ personal profile, which allows it to reduce secondary retrieval.
The user database contains the registration information of the users. This system carries out a user demand investigation when the user registers. It can add user preferences and extract as a conceptual feature the input word phrase(s) of the user. It also performs ontological mapping.
4.2. Ontology layer
The ontology base contains the lightweight cube citation ontology base, the domain ontology base, and the user requirement ontology base. It simplifies the entity level, builds a convenient, simple citation ontology, organizes, stores, and queries data in a machine-readable mode. For example, the terms “dc: title,” “fabio: hasTranslatedTitle,” “bibo: pageStart,” and “bibo: pageEnd,” respectively, define the title of the journal, the English title, the start page, and other related attributes. These describe the citation in detail and realize the knowledge association of the citation information.
The domain ontology base contains the domain ontology, which includes class, property, and instance of domain ontology, as well as the ontological semantics of citation resources. Song and Zhang  agree that ontology can represent the complex semantic relations in the content of the information resources; it has a solid concept of hierarchical structure that supports logical reasoning. It is helpful for us to organize and retrieve information.
User requirements ontology conducts user need surveys for users and obtains user preferences directly. It analyzes users’ search behavior, retrieves content, analyzes the users’ preferences, obtains the user database, and builds the users’ need ontology.
4.3. Semantic association layer
The semantic association layer will mainly analyze the content and related characteristics of the data, using Jena as the core processing tool, based on the pre-built domain ontology model. The information in the citation knowledge base is marked by Jena and uses Jena for reasoning. Finally, the SPARQL language is used to retrieve the information that has been marked. The semantic layer is based on the user requirement ontology and the user database and implements user requirements through scenario reasoning.
4.4. Functional layer
The functional layer provides the ontology-based citation knowledge service function, which currently includes a personal center module, a platform management module, and a knowledge service module.
The personal center provides new user registration and user data modification function. Platform management is the function of monitoring the entire knowledge service system that is used to operate the knowledge bases and databases. It mainly includes two modules: a knowledge base management module and a database management module. The knowledge service module includes the core functions of the knowledge service system, including ontology-based knowledge retrieval, knowledge navigation, and knowledge push modules.
4.5. The model for citation knowledge base
The functions of the proposed knowledge base include the following model: collecting literature from the citation database; extracting other relevant information from the authors’ home page and organization page and other information carriers; introducing lightweight cube citation ontology to extract consensus citation elements; simplifying the entity level; organizing, storing, and querying data in a machine-readable mode to produce a list of concept features; establishing the relationship between the concept feature list and the domain ontology map; associating the citation knowledge data with the domain knowledge; performing the semantic processing of information after the semantic annotation, expansion, and synthesis; using ontology for formalization; mining implicit semantics through semantic reasoning; and forming a citation knowledge base ultimately. The model for citation knowledge base is shown in Figure 7.
4.6. The model for user recommendation
This system uses the user registration information, the user need investigation, and the user log flow to obtain the user knowledge base. The user-related knowledge is analyzed, and the feature extraction is carried out. The users’ requirement ontology is constructed and mapped with the domain ontology and users’ characteristics. The mapping relationship is established, and the knowledge is organically related. This system uses the user knowledge base for knowledge extraction. The knowledge resources are classified and semantic associations are created. Entities are stored in the user requirement base, and ontology based on user recommendations is ultimately achieved, as shown in Figure 8.
In actual cases, data is usually collected in an ordered sequence. The distribution of the data is not static but changes over time. As certain factors change due to environmental factors, the regular pattern that the data has followed also changes; this is known as a concept change. The concept of “concept drift”  is that the rules that the data follows have changed throughout the sequence and the concept has drifted over time.
Because the users’ actual operation is uncontrollable and does not follow any existing model, any new factors may have an impact on the users’ operation, and concept drift in the users’ data acquisition process is inevitable; therefore, the user model requires regular evolution .
In order to reduce the effect of concept drift on the prediction effect, a triggering mechanism can be used to detect conceptual drift. Such change detection is based on statistics. It tracks the process of change in the user need concept set, removing the old data and re-adding the detected data to the users’ requirement base to improve the prediction accuracy.
In this chapter, we proposed a new citation analysis framework based on ontology and linked data. By combining these technologies into a new semantic web with citation analysis method, we were able to improve the traditional citation analysis method (which relies heavily on citation databases). Rapid advancements in semantic publishing  and projects like the open citation corpus  have made it possible to mark massive amounts of citation information as machine-readable RDF triples. In the future, we plan to design further experiments to verify the feasibility of the proposed method. We hope that introducing ontology and linked data into citation analysis will yield optimal results while facilitating new technological developments and innovations.
An ontology-based citation knowledge service system uses ontology technology, knowledge navigation, knowledge recommendation, and other technologies and methods to organize, store, and query data in a machine-readable mode. It can successfully search knowledge across resource types and databases. Through the semantic relevance and knowledge navigation of various resources, we can render resources more granular, standardized, and automated. Using the methods of concept drift to track changes in users’ needs achieves their information needs, and knowledge integration services provide users with more personalized and comprehensive services. This chapter constructs a framework of an ontology-based citation knowledge service system, aiming to provide new ideas for the development of knowledge services offered by traditional citation retrieval systems. We will focus on the realization of an ontology-based citation knowledge service system in the near future.
This work was supported by a grant from the national social science foundation of China (No. 16BTQ073).