Main fronts of e-learning research worldwide, with the occurrence values in the wordcloud obtained from SCImago Journal & Country Rank (Source: self-made).
The mapping overlay technique described in the scientific literature to analyze scientific domains must be complemented with procedures to identify and analyze the research fronts included in the cognitive structure of the represented domain. One possibility is the use of wordcloud maps to visually represent the cognitive structure of a discipline in any thematic domain, taking advantage of its capacity for abstraction and impact on the audience to stimulate new research processes. The case described in this chapter proposes an analysis of an emerging scientific discipline by using this combination of techniques (superposition and wordcloud) to explore its possibilities and limitations.
- mapping overlay
- emerging field
- SCImago Journal & Country Rank
The world scientific production analysis contributes, among many other things, to define the knowledge areas and subject categories that structure the generation of knowledge. Each classification system of scientific production defines its own areas and categories, which are mostly accepted by the scientific community that consults and feeds them. In this way, Scopus1 classifies the works into 5 thematic clusters (life sciences, physical sciences, health sciences, social sciences and humanities), 27 knowledge areas and more than 300 subject categories. Web of Science2 does it in 3 knowledge areas (sciences, social sciences and arts and humanities) and 250 thematic categories.
Scopus currently has more than 70 million records and a defined group of metadata3 that are rigorously linked to each publication to describe its academic, social and geopolitical context. These two characteristics, having large volumes of structured information, are the inputs for the application of visualization techniques that generate new representations of knowledge, thus becoming powerful tools for science analysis.
The bibliometric data are very valuable to identify the scientific publications with the greatest impact in a given discipline, (i.e., Information Systems , Renewable Energy, Sustainability and Environment , to recognize different scientific fields and understand their internal dynamics and cognitive structure , either as an already consolidated research field or as an emerging discipline.
There are multiple methods and tools to visualize bibliometric information. For example, the distance-based, the graph-based or the time-based . Mapping and clustering are also used to analyze the research fields of a scientific domain and the relationship between research fields and the evolution of the domain over time. As a tool, VOSViewer4 assures the comprehensive visualization of nodes labels on the map. These maps, called science maps, help to locate research results to explore collaborations and publication trends, to observe the evolution of a certain subject or discipline and for benchmarking activities between regions, countries, institutions, authors and disciplines . However, the visualization must have the capacity to handle large amounts of data at a small and large scale. This reduces the visual search time, providing a better understanding of a complex data set. It also reveals relationships that otherwise would not be noticed, allowing a data set to be viewed simultaneously from several perspectives, aiding the formulation of hypotheses and being an effective source of communication .
Through the overlay of science maps, the research bodies can be located visually within the sciences, analyzing the scientific development of properly established disciplines, trends or emerging research topics that do not fit into traditional subject categories. This is achieved thanks to the existence or construction of a stable corpus on which another smaller body can be overlaid , producing intuitive comparisons, of greater interpretation and with the potential to be used in scientific analysis.
In its essence, science maps are matrices of similarity measures, calculated from the correlation between items of information present in the structure of scientific communication. In other words, they show the disciplinary structure of the sciences in terms of publications. The stable or base map is constructed with bibliographic data from a database that has a definite categorization of the sciences. The analysis made from the overlap will be conditioned by the size of the data selected for it.
In the words of Guzmán, “we can say that the analysis of information with science maps, supported by metric information studies, allows graphically representing the relationships between documents published by specific disciplines or scientific fields. These show the sub-areas of research in which the discipline has been focused over the years in order to identify, analyze and visualize the intellectual structure, as well as the temporal evolution in which the analyzed disciplines are being developed.” .
Based on the above, science maps contribute to the identification of emerging disciplines by categorizing the publications that constitute their scientific communication channel .
However, a very select group of specialists usually carries out the analysis of these research products, since the results obtained are not easy enough to understand for most of the scientific community that is interested in knowing in detail the paths and trends that their discipline is taking. Faced with this need, other visualization techniques, such as wordclouds , infographics  and dashboards  have been positioned in virtual media as an alternative for the research results to achieve greater diffusion beyond the borders of scientific communication channels . Wordclouds are used mostly to visualize a data set collected from surveys or forms. Among its advantages are: (a) its ability to abstract towards the essential, identifying and grouping existing patterns in writing , (b) they help to provide a general sense of the text (the same visceral response does not occur when looking at a text page) through the analysis of sentiment , (c) they provide a quick response on possible topics of interest and research for their community , (d) the visual representation of data generates impact among the audience, stimulating more questions than answers, and (e) they allow to share the results of the research in a way that does not require a deep understanding of the technicalities. Its link with the bibliometric analysis can be established considering the keywords field as the set of data collected from users (researchers) in a form (submit manuscript).
This study combines the use of the mapping overlay technique with the visualization of terms in wordclouds to represent the research fronts of a subject, in this case, the e-learning emerging discipline. The aim is to determine if this technique combination produces more intuitive, dynamic and easily accessible results for researchers and non-researchers.
2. Mapping a research field
To perform the research field mapping, we must first establish a body of documents to perform the bibliometric analysis, ensuring access to the bibliometric data of this set of publications. To analyze the e-learning case, we started with the methodology and findings of Tibaná-Herrera and others  for the subject categorization.
Secondly, the subject research fronts are identified, which determine the consolidation of the different tendencies over time that have contributed to the development and growth of the subject in scientific communications . We propose the use of wordclouds composed of keywords , to visualize the research fronts of the field due to its representation capacity and rapid appropriation of the community to which it is presented.
2.1. Establishment of the body of documents
We start from the base that every research field has a set of scientific communications that contribute to the development of the subject. To identify these communications and analyze them, the subsequent steps can be followed:
Step 1. Definition of descriptors. It is about knowing all those terms present in the primary scientific literature with which the subject has been described. As expected, we start from a core term, which is generally the same as the research field. With this term, all the publications whose title, summary and keywords include the core term are identified in a comprehensible database.
Core term: e-learning
Data source: SCOPUS, database that indexes mostly journals and conference proceedings .
The search results should be refined according to the desired coverage degree in the analysis and the access availability of the bibliometric data.
Publication type: Journal and Conference Proceeding
Document type: Article, conference paper and review
Analyzed timespan: 2012–2014. It corresponds to a period in which there is a stable worldwide production in e-learning, since in the previous period it was in constant growth and in the following period there was a significant decrease in production .
The set of publications obtained can be used in its entirety or from a statistically representative sample.
Representative sample: 2000 (21.6%)
Then, a bibliometric analysis based on keywords co-occurrence is carried out, aimed to determining the primary descriptors that are mostly present in the publications, their relationships and relevance, by means of the Visualization of Similarities (VoS) technique . Additionally, they include secondary possible descriptors that reflect the same meaning, fruit of the linguistic similarities and/or acronyms or abbreviations that are used in the natural language. For example, when including the keywords of an article you can choose to use the e-learning or elearning descriptor . E-learning case Keywords: 4521 Primary descriptors: 51. E-learning, LMS, b-learning, online learning, Moodle, m-learning, ICT, learning objects, technology acceptance model, e-learning platform, adaptive learning, e-assessment, web-based learning, virtual learning environments, adult learning, informal learning, instructional design, SCORM, augmented reality, educational technology, intelligent tutoring systems, remote laboratory, simulation, learning analytics, learning environments, e-learning 2.0, teaching and learning, interactive learning environments, educational data mining, gamification, learning design, social learning, lifelong learning, metadata, MOOC, virtual classroom, labview, learning methods, personal learning environments, adaptive e-learning systems, computer-based learning, information literacy, virtual learning, Blackboard, continuing education, game-based learning, interactive learning, personalized learning, recommender systems, virtual laboratories, virtual reality. Secondary descriptors: 13. elearning, electronic learning, Learning management system, blearning, blended learning, mlearning, mobile learning, Information and communications technologies, eassessment, electronic assessment, VLE, Massive Open Online Courses and PLE.
Primary descriptors: 51. E-learning, LMS, b-learning, online learning, Moodle, m-learning, ICT, learning objects, technology acceptance model, e-learning platform, adaptive learning, e-assessment, web-based learning, virtual learning environments, adult learning, informal learning, instructional design, SCORM, augmented reality, educational technology, intelligent tutoring systems, remote laboratory, simulation, learning analytics, learning environments, e-learning 2.0, teaching and learning, interactive learning environments, educational data mining, gamification, learning design, social learning, lifelong learning, metadata, MOOC, virtual classroom, labview, learning methods, personal learning environments, adaptive e-learning systems, computer-based learning, information literacy, virtual learning, Blackboard, continuing education, game-based learning, interactive learning, personalized learning, recommender systems, virtual laboratories, virtual reality.
Secondary descriptors: 13. elearning, electronic learning, Learning management system, blearning, blended learning, mlearning, mobile learning, Information and communications technologies, eassessment, electronic assessment, VLE, Massive Open Online Courses and PLE.
Step 2. Correspondence of publications and descriptors. In a matrix containing all the indexed scientific publications and the primary and secondary descriptors identified, the number of articles published by the Conference Proceeding or the Journal with that descriptor in the title, abstract and keywords fields is recorded at each crossing. It is very important to use the same selection criteria described in the previous step to ensure information integrity. Then, the primary and secondary descriptors related to the same term are added, assuming that the sum reflects unique publications related to each other by the descriptors.
Journals and conference proceedings included in the matrix: 12.923
Step 3. Percentage of participation in the subject (PP). It is the percentage of articles in the publication that are related to the subject during the timespan established in the initial criteria, this is done by taking the maximum number of articles per descriptor, bearing in mind that an article may be related to more than one descriptor.
Correspondence matrix description (Figure 1):
3.680 journals and conference proceedings do not have any publication related with any of the 64 descriptors.
7.801 journals and conference proceedings have a PP lower than 5%.
Step 4. Cut-off point for the inclusion of publications in the analysis. You must determine the cut-off point over the PP from which the publications for the categorization of the thematic will be included. Other studies have classified publications among “pure”, “hybrid” and “unrelated” publications in a given subject  and on the determination of the core set of publications . However, we believe that this value should be established through the combination between the maximum allowed error of the subject relation of the publication and the average PP of the total set of publications. The higher the cut-off point, the greater the precision in the selection of journals will be. Although, this precision means a reduced volume, and if not, a low cut-off point increases the error in the selection and its volume. Once the cut-off point is established, all publications that exceed this threshold are considered as the basic set of analysis of the emerging subject category.
The set of publications must maintain an average PP higher than 50%, for which the cut-off point per publication was established at 25% (coinciding with the classification of pure and hybrid publications ).
The cut-off point included 11 publications that were excluded because they defined other areas of knowledge in their scope.
82 journals and 137 conference proceedings that meet the criteria of the methodology were identified.
Step 5. Publication set analysis. The set of selected publications is analyzed under a bibliometric approach (a) to determine if it represents the existence of a scientific community that communicates its knowledge through these channels and (b) to recognize it as an emerging and distinctive scientific discipline that can be defined as a transversal thematic category . For this, the mapping overlay technique  can be used, which facilitates the exploration of the knowledge bases of an emerging discipline and its evolutionary dynamics. This technique requires a base map on which to overlay a local map (thematic) and thus make comparisons. This overlap allows placing the discipline in the general topology of scientific knowledge and identifying whether a cluster effect occurs, which should be considered as evidence of the existence of a specific disciplinary field from the point of view of scientific communication guidelines followed by the researchers.
The relation degree of publications is established by the normalized value produced by the combination of citations, co-cites and coupling [22, 23]. In addition, this analysis can be enriched with the distribution by clusters that visualization tools perform, such as VOSViewer .
The base map is a global map of science that includes the total number of publications indexed in SCOPUS, made up of 7 clusters, which in a clockwise and broad sense can be named as follows: Social Sciences (red), Psychology (light cyan), Medicine (green), Health Sciences (purple), Life Sciences (yellow), Physical Sciences (dark cyan) and Engineering and Computer Science (blue) (Figure 2).
The composite indicator was arranged by SCImago Journal & Country Rank5.
The local map that is overlaid on the global map of science is the set of 219 publications selected in the previous step (Figure 3).
There is a cluster effect that shows a high cohesion among publications, which is sufficient evidence, in terms of scientific communication, that e-learning is a distinctive scientific discipline, since there is a network of relationships and interactions that are established between the authors and scientists who share thought structures, cooperation patterns, language and forms of communication.
The publications distribution shows a main group in Social Sciences and other small groups in Computer Science and Psychology.
2.2. Identification of research fronts
To identify the research fronts through the visualization of keywords in a wordcloud, it is necessary to identify the body of publications on which the analysis is going to be carried out (previous section). Then, all the keywords of the publications are extracted, keeping the same filters defined in the previous stages, with the confidence of finding a set of structured and well-defined terms. This technique provides value when the data has a treatment that ensures a correct interpretation. This is done through two tasks, being the first to refine the set of terms (which can be in the order of thousands) to obtain those that are mostly different and that can be visually represented without loss of information. The refinement process may include a minimum threshold of articles published by a journal or conference report to ensure that there is a volume and regularity guaranteed in the conceptual development of the thematic. It can also be refined by defining the number of terms to be displayed in the wordcloud.
Publication type: Journal and Conference Proceeding
Document type: Article, conference paper and review
Analyzed timespan: 2012–2014.
Minimum number of papers published by Journal/Conference Proceeding: 100
Number of terms to display: 100
The second task is to configure the variables that determine the form of the wordcloud, among which are:
Keep each term with its own length. You can fall into the error of disaggregating terms, for example, the term Information and Communication Technologies should remain as one and not separate it into 3 or 4 parts.
Don’t include terms in the visualization that correspond to the same name of the scientific field analyzed, places, dates, proper names, names of organizations and all others that don’t contribute to the identification of research fronts.
Define simple shapes to represent the cloud. Today there are multiple wordcloud creation tools. Most of them allow to use a defined image for the cloud layout. It is recommended to use images without internal content, only frame, so that the words can be distributed inside without obstacles.
Select a Sans Serif font. The wordclouds are presented more frequently in digital media, in which a clean, non-blurred reading is sought, to avoid visual fatigue
Define an intention for the color usage. The visual representation should be as enriched as possible. Therefore, the color defined for each term must show its own characteristic. A good intention of color is to establish clusters of terms  that determine the main research fronts.
Examples of wordclouds.
Finally, by means of a rapid visual analysis of the generated wordcloud, the research fronts of the scientific field can be identified in a differentiated way.
A limitation of wordclouds, that can affect the reader’s interpretation, is the term length that can capture a quick attention being located in a central place of the visualization without having significant weight. However, this visualization technique is a powerful tool to abstract relevant information from large volumes of information, in addition, it can be used to observe the main trends of other bibliometric data. For example, journals and congresses with the greatest influence in the discipline or the institutions and countries that contribute the most to the discipline productivity.
This study proved that bibliometric analysis combined with visualization techniques provides sufficient elements to map an emerging discipline, in this case study, e-learning.
The mapping overlay technique allows visualizing the existing cohesion between the scientific communications generated by the community of researchers in the subject, determining the knowledge areas in which the research activity is developed and establishing the base set of publications for other bibliometric analyzes. Through this technique it was determined that e-learning has its scientific development mainly in the social sciences.
The visualization of the main keywords present in the set of publications of a discipline through wordclouds, allows to clearly identify the research fronts of this subject, by grouping the research topics and showing their relative weight in the scientific development of the discipline. In the case study, two main research fronts were identified in e-learning, interactive learning environments and teaching and learning strategies.
To SCImago Research Group for providing citation data for publications.
Conflict of interest
The data related to this research were obtained, on the one hand, from the access to SCOPUS and on the other, provided by SCImago Research Group. These are protected by licensing and copyright respectively.