Biomedical concept categories and numbers in GENIA.
With the enormous volume of biological literature, increasing growth phenomenon due to the high rate of new publications is one of the most common motivations for the biomedical text mining. Aiming at this massive literature to process, it could extract more biological information for mining biomedical knowledge. Using the information will help understand the mechanism of disease generation, promote the development of disease diagnosis technology, and promote the development of new drugs in the field of biomedical research. Based on the background, this chapter introduces the rise of biomedical text mining. Then, it describes the biomedical text-mining technology, namely natural language processing, including the several components. This chapter emphasizes the two aspects in biomedical text mining involving static biomedical information recognization and dynamic biomedical information extraction using instance analysis from our previous works. The aim is to provide a way to quickly understand biomedical text mining for some researchers.
- text mining
- natural language processing
- information extraction
With the rapid growth of the high-throughput biological technology, the study of biomedical science is entering omics era. It brings several omics data including genomics and transcriptomics; the vast amounts of biological data continue to emerge out of the life science research. The new phenomenon, new discovery, and new experimental data in biomedical research are mostly published in science journals by electronic text form. A large number of biological information is scattered in all kinds of studies. Handling these biomedical literatures could extract more biological information and discover new biomedical knowledge. Manual processing is like looking for a needle in a haystack. Biomedical literature can be seen as a large unstructured data repository, which makes text mining come into play. Text mining has emerged as a potential solution to achieve knowledge for bridging between the free text and structured representation of biomedical information using artificial intelligence technology including natural language processing (NLP), machine learning (ML), and data mining to process large text collections. Therefore, text-mining technology is a powerful tool for mining valuable information from biomedical literature. There are broader definitions of biomedical text mining. Namely any work that extracts information from text could be considered as text mining, which would include a range of static information recognition, from dynamic information extraction to application of biomedical text mining. In the following sections, we describe the details according to the abovementioned aspects.
2. Natural language processing
Natural language processing is a field of artificial intelligence in computer science with interaction between computers and natural languages. With the mushroom growth of machine learning, much natural language processing research has a great relationship with machine learning. Many machine-learning algorithms have been applied to natural language-processing tasks. Extracting structured data from the complexity of heterogeneous-narrated medical reports is significantly challenged, and the work  obtained the results with F1 scores greater than 95% using machine learning, for example, HMM model. Weng et al.  proposed the machine learning to classify clinical notes to the medical subdomain. They reported that the classifier of the convolutional recurrent neural network with word embeddings yielded the best performance on iDASH and MGH datasets with F1 scores of 0.845 and 0.870, respectively. Basaldella et al.  proposed a hybrid approach using a two-stage pipeline with a machine-learning classifier combining a dictionary approach, and they achieved an overall precision of 86% at a recall of 60% on the named entity recognition task. For helping to obtain biomedical knowledge, a flourishing of ontologies attempted to represent the complexity of the biomedical concepts in text-mining area. The ontologies describe a wide variety of biological concepts spanning from biology to medicine. Moreover, they not only attempt to capture the meaning of a particular domain based on biomedical community but also are key element for knowledge management, and data integration . Ontologies and controlled vocabularies could improve the efficiency and consistency of biomedical data curation, which has a great increasing interest in developing ontologies. For example, gene ontology (GO)  is a community-based bioinformatics resource related to gene function used to represent biological knowledge involving three aspects: molecular function, cellular component, and biological process. Molecular function describes the molecular activities of gene products, such as binding or catalysis. Cellular component indicates the place where gene products are active, namely a component of a cell such as an anatomical structure. While biological process represents pathways, larger processes are made up of the activities of multiple gene products. Another is GENIA  corpus developed to provide reference materials to allow natural language-processing techniques work for extracting information. GENIA corpus is a semantically annotated corpus of biological literature, which is being compiled and annotated in the scope of GENIA project, aiming at providing high-quality reference materials for bioinformatics. Natale et al.  proposed protein ontology (PRO) for enhancing and scaling up the representation of protein entities represented in OWL and SPARQL format. Ontologies provide machine-readable descriptions for biomedical concepts linking domain-specific vocabulary. Ontology-based mining systems attempt to map a terminology in text to a concept in an ontology. Kim et al.  applied syntactic parsing on sentences with annotated GO concepts to exploit similarities of sentential syntactic dependencies for mapping to concepts.
The natural language processing (NLP) components include general tasks including tokenization, morphological analysis, POS tagging, and syntactic parsing.
Tokenization describes the processing that the text is broken into sentences and words. When the text is put into text-mining system, for example, a paper could be viewed as a continuous word stream; they are first broken up into chapters and paragraphs, and then the broken paragraphs continue to be pieced as sentences, words, and phonemes for a more sophisticated processing. For the tokenization task, the tokenizer could extract token features which are types of capitalization, digits, punctuation, special characters, and so on.
POS tagging is aiming at the words to annotate tags based on the context in the text. POS tags divide words into categories based on the role in the sentence. POS tags provide information about the word’s semantic content. Nouns usually denote the entities, whereas prepositions express relationships between entities.
Syntactical parsing performs a full syntactical analysis of sentences according to a certain grammar theory including constituency and dependency grammars. Constituency grammars describe the syntactical structure of sentences in terms of phrases, namely element sequences. Most constituency grammars generally contain noun phrases, verb phrases, prepositional phrases, adjective phrases, and so on. Each phrase may consist of smaller phrases or words according to the rules of the grammar. The role of different phrases is contained in the syntactical structure of sentences. For example, a noun phrase may be marked as the subject of sentence, object. Dependency grammars focus on the direct relations between words, not considering the constituents. Dependency analysis uses Direct Acyclic Graph (DAG) to denote the relations between words using nodes and dependencies for edges. For example, a subject depends on the predicate verb, while an adjective depends on the noun and so on.
3. Biomedical text mining
With the enormous volume of biological literature, increasing growth phenomenon due to the high rate of new publications is one of the most common motivations for the biomedical text mining. It is reported that the growth in PubMed/Medline literature is exponential at this rate of publication. Thus, it is very difficult for researchers to keep up with the relevant publications in their own discipline, let alone related other disciplines.
Such a large scale and the rapid growth of biomedical literature data, carrying a lot of biological information, some new phenomena, biomedical discoveries, and new experimental data are often published in recent biomedical literature. Aiming at this massive literature to process, it could extract more biological information for mining hidden biomedical knowledge. These vast amounts of biomedical literature, even in the field of expert, could not rely on the manual way from fully grasp the status quo and development trend of the research to obtain the information of interest for extracting biomedical knowledge. It is the urgent needs of application of text mining and information extraction from biomedical literature in the field of molecular biology and biomedical knowledge extraction. Biomedical text mining  is the frontier research field containing the collection combined computational linguistics, bioinformatics, medical information science, research fields, and so on. The development of biomedical text mining is less than 25 years , which belongs to a branch of bioinformatics. Bioinformatics is defined as application information science and technology to understand, organize, and manage biomolecular data. It aims to provide some tools and resources for biological researchers, facilitate them to get biological data, and analyze data, so as to discover new knowledge  of the biological world. Biomedical text mining is a sub-field of bioinformatics. It refers to the use of text-mining technology to process biomedical literature object, acquire biological information, organize and manage the acquired bioinformation, and provide it to researchers. Therefore, biomedical text mining can extract various biological information , such as gene and protein information, gene expression regulation information, gene polymorphism and epigenetic information, gene, and disease relationship. The biological information can help people to understand life phenomena and understand the rules of life activities. Using the information will help understand the mechanism of disease generation, promote the development of disease diagnosis technology, and promote the development of new drugs in the field of biomedical research. A large number of text-mining methods have been established to assist in the extraction of biological information. These methods could be proposed for extracting information that vary in their degree of reliance on dictionaries, statistical and knowledge-based approaches, automatic rule generation applying part-of-speech (POS) tagger, and some machine-learning algorithms, for example, Hidden Markov Models (HMMs) and decision trees. Cronin et al.  classified patient portal messages by a comparison of rule-based and machine-learning approach using a bag of words and natural language-processing (NLP) approaches. The best performance of classifier for individual communication subtypes was random forests for logistical-contact information with 0.963 receiver-operator curve.
In addition, there are some social institutes to focus on the development of biomedical text-mining technology. Based on the rapid development of omics era, the BioCreative called Critical Assessment of Information Extraction system in Biology is a community-wide effort for the evaluation of text mining and information extraction systems applied to the biomedicine domain using natural language processing [14, 15].The researches of biomedical text mining are presented at several conferences including Pacific Symposium on Biocomputing, BioNLP, and Practical Applications of Computational Biology and Bioinformatics .
4. Static biomedical information recognition
In the era of system biology, from the system perspective-related information on molecular biology research includes both biomedical entities, some genes, proteins, gene products, drugs, diseases such as basic, static entities to reflect its existence form, called static biomedical information. There are some biological terminologies that describe domain objects in the medical literature, which is called entity. Such as gene that is the essence of life information, protein information that is the executor of gene function in biomedicine, identification of these entities in the life sciences plays an important role in revealing the phenomenon of life, which is the only way which must be passed to further explore these important biological entities, but also an important task in biomedical text mining. Biological entity representation in biomedical literature is extremely complex. The complexity of performance in both single entity in the form of a word entity, variable word length, and uppercase and lowercase mixed together, for example, urokinase, Cactus, IkapaBalpha, and so on. There are multi-words to form phrases, such as bradykinin B (1) receptor, protein phosphatase 2A, which brought a great deal of difficulty to establish biological entity boundaries; some of the same words or phrases that can be expressed in different categories of biomedical entities, such as c-myc, IL-2 protein can be expressed, can be said to detect gene, through the context, some biomedical entities have different forms of writing, such as protein phosphatase 2A, protein phosphatase, and 2A protein phosphatase 2A and refers to as the same biological entity in biomedical text. The complexity and diversity of the entity recognition of biomedical named entities has become a challenging study. Traditional recognition approaches have three methods containing dictionary-based, rules-based, and statistical machine learning.
Dictionary-based approach needs a detailed term dictionary for entity recognition from the document. Generally, the pre-given term dictionary is edited by the specific biological molecular database. Accordingly, the approach is limited by the term dictionary coverage but it has the relatively higher accuracy. Thus, the approach is widely applied in the actual development in the system. For example, Whatizit  and FRENDA .
Rule-based approach requires studying the entities’ named features and laws for formulating rules to identify the entity, which make the staff to develop biomedical domain background knowledge. For example, biomedical entities are often noun phrases. The first character of a human gene nomenclature is a letter, and the rest is a combination of letters and numbers. Based on the successful rule system, AbGene , Tanabe and Wilbur used it to produce a large and high-quality gene protein dictionary from biomedical texts, and it is also used as a component of relationship extraction .
Statistical machine-learning approach is the rapid development along with the construction of biomedical corpora. The annotated corpus could be used as the training corpus of statistical machine-learning method. With the help of the biomedical corpora, entities could be identified from the biomedical text. There are some researches in the aspects. ABNER developed by Settles  used the Conditional Random Fields (CRF) as the statistical model to identify biomedical entities with the average 72.0% recall, the 69.1% precise, and 70.5% F-score in JNLPBA using morphological and semantic features. Mitsumori et al.  proposed an approach to process entity using Support Vector Machine (SVM) as a statistical model with internal and external resource features, which show that the performance of identification is improved by using the external biological dictionary features. Saha et al.  used maximum entropy model combined with word-clustering features and feature selection techniques to identify biomedical entities. The approach achieves better performance without using domain knowledge. Li et al.  use two-step CRFs to identify biomedical entities, the first CRF model is used to identify named entity and the second CRF model is used to the types of named entities, which obtained 74.3% F-score in JNLPBA corpus. The mentioned three approaches have their own advantages, respectively. There is also a hybrid approach to be used for identifying biomedical entities. Our research group does some works related to identify biomedical entities involving the abovementioned three methods. We give several examples to illustrate the related static biomedical information recognition.
4.1. Dictionary-based approach to identify biomedical entities
In this section, we introduce our previous work [24, 25, 26] which is published in the International Journal of Pattern Recognition and Artificial Intelligence. We first achieve experimental literature and build a concept dictionary based on the authoritative corpus. Using part-of-speech (POS) tagging, phrase block’s formulation and designed VWIA algorithm to identify entities for matching biomedical concepts. Figure 1  describes the pipeline.
4.1.1. Obtain experimental data
The experimental data of the study are from PubMed/Medline using the e-utilities API tool which is also used in the works [27, 28, 29] for automatically downloading literatures from the website (http://www.ncbi.nlm.nih.gov/). It looks like the web spider to catch a series of hyperlinks with a Uniform Resource Locator (URL). By obtaining URL using e-utilities related to biomedical literature, the contents are achieved based on pattern mapping, and the crawled contents are stored in the local database to be further processed.
4.1.2. Preprocess, POS tagging, and phrase block
Aiming at the collected literature, to every biomedical literature, there may be several paragraphs, and each paragraph may include one or several sentences. The basic textual units identified by the tokenization as constituent tokens in each sentence are tagged by the Standford POS tagger . It is a tool that appoints parts of speech to every word (for instance, verbs, adjectives, nouns, and so on) by reading text in some language with the implementation of log-linear-part-of-speech taggers. The results are shown in Figure 2 aiming at an example sentence. Aiming at the features of biomedical concept, the POS-word pairs are important for the biomedical concepts. For example, nouns, adjectives, participles, and so on, could be components of some biomedical concepts, while some words, such as indefinite articles and verbs, would be omitted to the identification of the biomedical concepts. Thus, we extract phrase blocks combined by nouns, adjectives, and participles. These phrase blocks are preprocessed as a unified base form. For example, the word of tagging POS labels “NN” (single noun) and “NNS” (plural noun) transforms the word of single noun for the normalization. Figure 2 shows an example about the POS tagging .
4.1.3. Biomedical dictionaries’ construction
Considering the high precision of dictionary-based approach, the approach used it for the identification of the biomedical concepts. Thus, concept dictionary is built by some authorized biomedical ontologies, for example, Disease Ontology (DO), Gene Ontology (GO), and GENIA ontology. GENIA corpus is used to provide reference material for biotext mining. GENIA corpus is also semantically annotated corpus containing 2000 MEDLINE literature with almost 100,000 annotation information and more than 507,325 words including 18,546 sentences for biomedical terms in 3.02 version. Each article is encoded in an XML-based mark-up scheme with ID, title, and abstract in the corpus. Moreover, both abstracts and titles have been marked up for meaningful annotated terms, biologically and semantically. The corpus provides semantically annotated biological terms identifiable with any terminal concepts for GENIA ontology, For instances, “<cons lex = “IL-2_gene” sem = “G#DNA_domain_or_region” > IL-2 gene</cons>”, the label “lex” describes the concepts, and the label “sem” represent the type of concept. Due to the structural scheme of GENIA ontology, it could be extracted by the regular expression. These extracted biologically meaningful terms build concept dictionary for the biomedical text mining. For example, the concept taxonomy of GENIA 3.02 version contains some biologically relevant nominal categories as shown in Table 1 .
Biomedical texts are divided into a few sentences in biomedical literature, and for every sentence, some phrase blocks are parsed. The algorithm named Variable-step Window Identification Algorithm (VWIA) is developed for identifying biomedical concepts. The approach obtained the overall 95.0% F-measure aiming at the GENIA corpus. The implementation of the approach is shown in Figure 3 .
4.2. Machine-learning approach
The pipeline of our system mainly contains four modules: preprocessing module, training module, tagging module, and testing module.
4.2.1. Preprocessing module
Applying machine-learning methods, the original biomedical texts could not directly be processed by the CRF model; they would be preprocessed to form the related formats. The original training, testing, and predicted texts would be processed to the specified format files with feature dataset.
4.2.2. Training module
Trained texts are put into the preprocessing module, aiming at the selected and extracted features to form the specified trained files that are trained with some parameters’ set, which could obtain trained model files. The input of training module is the result of the preprocessed training texts, while the output of it is the model files including feature function set and weight parameter.
4.2.3. Tagging module
After the preprocessing module processes the test texts, the formed specified test files are put into a tagging module together with the model files, which could obtain the tagged files. The input of a tagging module is the result of preprocessed testing texts, while the output of it is the tagged result files.
4.2.4. Testing module
The testing module is to measure the performance of our system. After processing the test texts in the preprocessing module, the test standard files are put into a test module together with the tagging result files. The input of a tagging module is the results of preprocessed test files, while the output of it is the identified performance of our system including precision, recall, and F-measure.
The approach considered features including POS features, word surface clue feature (uppercase, lowercase, numbers, specific char, initial) using the Genia corpus 3.02 version to train and test the system’s performance with a 10-cross validation. The system’s performance is shown in Table 2 related to the six classes. According to the above method, using Java programming based on Linux OS to develop Biomedical entity recognition Miner system called (BerMiner). Figure 5 shows the results of the identified biomedical entity.
|Project||Precision (%)||Recall (%)||F-measure (%)|
5. Extraction of dynamic biomedical information
Biomedical entities produce a series of information interaction in the process of genetic information transfer and expression, such as genes and gene interaction relations, the relationship between genes and disease, the relationship between genes and gene product, molecular signal conduction pathways, and so on. The information is represented as the dynamic form. Dynamic biomedical information represents the process of activities of biomedical entities. Dynamic biomedical information is extracted, namely association between biomedical entities which is often extracted based on entity co-occurrence analysis with statistics theory. Glenisson et al.  proposed an approach to extract the relationship between genes using vector space method and k-medoids algorithm for gene clustering. Wren  explored the method to measure biological entities relationship using mutual information model. Wu et al.  researched the interactions between genes and drugs based on text-mining technology, which was divided into three steps. First, the approach identified genes and drugs entities from Medline abstract; the second step is to extract different levels of gene-2-drug pairs. The third step is to rank the gene-2-drug pairs based on mathematical statistical model. Our research group has done the similar works [31, 32, 33]. Here, we example the works [31, 33] for describing the process related to the extraction of dynamic biomedical information in detail.
5.1. Relationship extraction based on statistic model
Dynamic information represents the process of activities of biomedical entities. In this study, we focus on the dynamic process of biomedical entities, and extract dynamic biomedical information, namely association between biomedical entities, based on entity co-occurrence analysis which is attached to the statistics theory with more precision. Entity co-occurrence analysis considers that if any two entities occurred in a certain level of paper (e.g., a full text, a paragraph, a sentence, and a phrase), then the two entities could have be related. Different levels have different strengths of entity association. Through syntactic analysis of biomedical texts, the weight of association in phrase level is the highest than in other two levels, and the sentence level is higher than the full text. Due to the multiclass biomedical entities used in this study, extracted associations are also multiple types. We build a data modeling of entity association based on entity co-occurrence analysis with statistics. The data modeling could formally be represented as a three tuples as shown in Eq. (1)
Supposing represents entity category where ;
Let be association category related to the entity category and . For instance, could represent association between gene and disease, or association between gene and microRNA, and so on. Let be correlated factor between entity and as shown in Eq. (3)
After building data modeling of entity association, we further consider extracting these dynamic biomedical information from biomedical literature. Aiming at the entities identified, we design an algorithm of Mining Multiclass Entity Association, named (MMEA), under the data modeling based on co-occurrence statistical analysis as shown in Figure 6.
The input of MMEA algorithm is entities identified building on the step of entity recognition. The MMEA algorithm first gets the category of each entity (lines 4–6). Aiming at an entity , the algorithm decides the types of associations between it and other entities (line 7) and computes the correlation factor by Eq. (3) (line 8). The above-achieved results are stored in the four tuples for further processing (line 9). The process proceeds with the increment of
5.2. Relationship visualization
Biomedical text mining focuses on the centrality of user interactivity, and it needs to provide users for interacting with data results. The text-mining visualization facilitates user interactivity with graphical approaches. For example, we developed a circle network graph to allow disease researchers to explore the relationship between genes related to breast cancer. To capture the susceptive genes related to breast cancer, a fan-like network visualization is designed by our research group. The node represents the biomedical entities, and the link lines indicate the associations between entities (as shown in Figures 7 and 8).
This chapter first introduces the rise of biomedical text mining. Then, it describes the biomedical text-mining technology, namely natural language processing, including the several components. In the following sections, it emphasize the two aspects in biomedical text mining involving static biomedical information recognization and dynamic biomedical information extraction using instance analysis which our previous works.
This research is supported by the National Natural Science Foundation of China (Grant Nos: 61502243, 61502247, 61272084, 61300240, 61572263, 61502251, and 61503195), Natural Science Foundation of the Jiangsu Province (Grant Nos: BK20130417, BK20150863, BK20140895, and BK20140875), China Postdoctoral Science Foundation (Grant No. 2016M590483), Jiangsu Province postdoctoral Science Foundation (Grant No. 1501072B), Scientific and Technological Support Project (Society) of Jiangsu Province (Grant No. BE2016776), Nanjing University of Posts and Telecommunications’ Science Foundation (Grant Nos: NY214068 and NY213088). This work is also supported in part by Zhejiang Engineering Research Center of Intelligent Medicine (2016E10011).