Open access

Information Extraction Approach for Clinical Practice Guidelines Representation in a Medical Decision Support System

Written By

Fernando Pech-May, Ivan Lopez-Arevalo and Victor J. Sosa-Sosa

Published: 06 September 2011

DOI: 10.5772/intechopen.84026

Chapter metrics overview

2,260 Chapter Downloads

View Full Metrics

1. Introduction

Errors in healthcare are a leading cause of death and injury. Kohn et al. (Kohn et al., 2000) mention that, for example, preventable adverse events are a leading cause of death in the United States. In their studies they state that at least 44,000 and perhaps as many as 98,000 americans die in hospitals each year as result of medical errors. Similar scenarios are for other countries. This situation has motivated the usage of Clinical Practice Guidelines (CPGs) to reduce the uncertainty of the clinical professional (nurses and physicians) when making decisions about the patient illness.

Clinical Practice Guidelines (CPGs) are documents containing guidelines and structured recommendations that are defined by domain experts based on medical and scientific evidence (Teije et al., 2006; Twaddle, 2005). Thus CPGs provide guides and scientific evidence to clinical professional to make flexible recommendations about specific health circumstances (Field & Lohr, 1990).

The main objective of CPGs is to offer to clinical staff a set of recommendations that are focused on helping in the diagnosis, prognosis, and treatment of specific illness. The goal is to enhance the medical attention to patients. Furthermore, a CPG is an important support for the patient itself and his/her family on understanding the efficiency of a treatment and an important tool to improve the quality in medical care. Because CPGs are largely documents in narrative form, sometimes are ambiguous and lack of a defined structure and internal consistency, which make them too complicated for being understood directly by a computer. The information usually contained in a CPG is plain texts, lists, diagrams, tables, and annotations in HTML, XHTML or PDF format. To make this information understandable for a computer, it is required the usage of CPGs formal representation languages (Clercq et al., 2004; Votruba et al., 2004). In this sense, many researchers have proposed different frameworks, approaches, representation languages, and tools for CPGs modelling, which can be interpreted by computers (Hripcsak et al., 2005; Isern & Moreno, 2008). Some of these tools and approaches provide orientation for a specific representation of CPGs, others are intended for a more general use in several representation languages. However, nowadays the formalization process is still carried out manually. Although there exist several formal languages, their usage represents a complex and time-consuming work for a manual formalization of CPGs. The usage of tools for the formalization of CPGs requires not only knowledge about formal methods, but also about the medical domain.

This paper describes a basic Information Extraction (IE) approach to enhance the knowledge acquisition on Clinical Practice Guidelines. The aim is to support the CPG modeller during the formalization process facilitating the CPG interpretation by computers, becoming an important module in every Medical Decision Support System. The output of this approach can be used for a better understanding by non-clinical medical people of CPGs. The starting point of this work was motivated for a preliminary effort wherein a CPG interpreter was developed (Pech-May, 2010).

The paper is organised as follows. In section 2 the contextualisation background about Clinical Practice Guidelines and Information Extraction is given. Section 3 describes the proposed Information Extraction approach. Section 4 shows the experiments carried out and the obtained results of a first prototype for the proposed approach. Finally, the section 5 presents some conclusions, remarks, and further work.


2. Background

2.1. Clinical Practice Guidelines

For interpretation issues, the medical procedures within a CPG are translated into algorithms that describe such procedures for diagnosis, prognosis, and treatment. The representation of a CPG as algorithms allows the organization of the relevant information in a directly applicable manner. In consequence this representation can enhance and support the decisions making process (Patel et al., 2001; Lyng et al., 2008).

Several languages for CPGs representation have been developed for different purposes, users, and applications. Shiffman et al. (Shifman et al., 2000) described the requirements for modelling the knowledge of CPGs taking into account issues as completeness, expressivity, usability, and reuse.

Most of the representation languages use the XML format as readable-machine language. Some of the most used languages for representing CPGs are:

  • Asbru (Young et al., 2007) is a task-specific and intention-based plan representation language. It was designed specifically for a set of management-task plans. Some tools that help the formalization of GPCs in Asbru are AsbruView

    and DELT/A


  • GLIF (Wang et al., 2004) (the Guideline Interchange Format) defines an ontology for the representation of CPGs, as well as a medical ontology for representing medical data and concepts. GLIF (on its third version) includes a formal expression language for specifying decision criteria and patient state. A tool that allows modeling CPGs in GLIF is Protégé


  • GEM (Ciccarese et al., 2004) (the Guideline Elements Model) is an XML-based guideline document model that can store and organize the heterogeneous information contained in practice guideline documents. A tool that supports the formalization of CPGs in GEM is GEM Cutter


  • EON (Tu et al., 2001) is a guideline modeling and execution system that is part of the EON architecture, a component-based suite of models and software components for the creation of guideline-based applications.

  • PROforma (Sutton and Fox, 2003) allows the guideline to be modeled as a set of tasks and data items, it is designed to support the management of medical procedures and clinical decision making at the point of care. The PROforma task model divides a generic task (keystone) into four types: plans, decisions, actions, and enquiries.

The following list includes some remarkable aspects that are considered in the most typical formal languages such as Asbru, PROforma, and GLIF.

  • Organization of plans: Asbru as well PROforma use an isolate generic object class for modelling plans: the plan object. GLIF uses two types of plans: guides and macros. The guides cover the direction and flow control decisions. The macros are used to specify in a declarative way the procedure patterns for specific purposes by means of a set of implementation steps, which appear as a one block in the CPGs.

  • Specification of goals/intentions: GLIF specifies goals as strings; Asbru represents the intentions of plans as temporal patterns depending on the context.

  • Action model: Actions are the primitives for the modeling used to represent tasks in a CPG (for instance, prescription, clinical research). All the languages allow specifying medical actions, but just GLIF has special structured classes to do it. This modeling method has an efficient mechanism to map instances of medical actions to terms of a restricted vocabulary. Regarding to the effect of actions, Asbru and PROforma, unlike GLIF, support express effects allowing to reason about actions based on its effects. In Asbru the effects of a plan can be used to select between different alternative plans and express causal relations. In turn in PROforma the effect of actions are modelled as postconditions, which are semantically different to the effects on Asbru because they represent assertions when an action is completed.

  • Representation of medical knowledge and patient data: PROforma model medical knowledge by means of relations between concepts (indications, conindications, interaction between drugs, etc.). Such relations are included as arguments in alternative decisions. In GLIF the medical knowledge is represented as instances of concept-relation. Asbru has not a explicit representation for this kind of knowledge as part of the CPG model; nevertheless, this knowledge can be accessed by means of functions calls.

The use of any of the formal languages involves several additional tasks that depend on the particular language. To tackle this issue, several assistant tools have been developed to support the formalization process. They range from markup-based tools, such as DELT/A, Stepper, and GEM-Cutter, to graphical tools using symbols to model diagrams, such as Protégé or the plan body wizard of the DeGel framework. A brief description of these tools is given below.

  • Stepper (Růzicka and Svatek, 2004) is a markup tool for the formalization of narrative CPGs. The formalization of the CPGs is done through user-defined stages and each stage transformed to XML.

  • GEM Cutter (Karras et al., 2000) transforms CPGs into GEM format. The GEM Cutter interface shows the textual CPG and its XML representation, thereby facilitating the user interaction in the transformation of the CPG.

  • DELT/A (Votruba et al., 2004) support the translation of HTML documents to XML. DELT/A provides two main features: (1) linking between a textual guideline and its formal representation, and (2) applying design patterns as macros forms.

  • Uruz is part of the Degel framework (Shalom et al., 2003), it uses a markup mechanism that allows the user to introduce medical terms in the CPG. Such terms can come from some vocabulary as ICD-9-CM (ICD-9-CM, 2010)

    International Classification of Diseases. This is a classification of diseases and procedures used in the coding of clinical information derived from medical assistance, mainly in the hospital environment and specialized medical care centers.


  • Protégé (Gennari et al., 2002) is a general purpose tool for knowledge acquisition. It is broadly used in several knowledge domain fields. This tool allows modelling CPGs in different representation formal languages.

  • AsbruView (Kosara et al., 2002) is a graphical user interface for Asbru to support the development of CPGs and medical protocols. AsbruView is focused on visualising data and plans during the design and execution.

According to the specialized literature, important work has been done trying to translate CPG documents into a readable-machine presentation (Pech-May, 2010). Dart et al. (Dart et al., 2001) proposed a generic model to represent any CPG in XML format. Moreover, they proved that CPGs can be modeled in a generic XML file. Bosse (Bosse, 2001) developed an interpreter capable of simulating CPGs written in Asbru language for one CPG. Geldof (Geldof, 2002) presented a methodology to formalize CPGs in several languages, from his understanding until computerizing in XML. Aguirre-Junco et al. (Aguirre-Junco, et al., 2004) described a knowledge specification method based on a structured and systematic analysis of text allowing a detailed specification of a decision tree for CPGs. Fuchsberger and Miksch (Fuchsberger & Miksch, 2002) presented an execution unit tailored for a particular CPG representation in Asbru plans.

The main aim for all the above work is the reasoning with the extracted medical knowledge. A reasoning process over such knowledge is desirable by nurses and physicians. Following this tendency, some efforts have been done to develop Medical Decision Support Systems (Kaiser & Miksch, 2005). But, like in similar works about Decision Support Systems, the knowledge acquisition becomes a bottleneck, which is the main limitation for this kind of systems. In this sense, we are introducing an approach to extract knowledge from textual CPGs that integrates an innovative Information Extraction module that facilitates knowledge acquisition.

2.2. Information Extraction

The Information Extraction (IE) is responsible for structuring information contained in plain texts, which can be relevant for a particular domain (called extraction domain) (Karras, et al., 2000; Lehnert et al., 1994). The IE is a research subject that covers many areas. The goal of an IE system is finding and linking relevant information while ignoring the strange and irrelevant information. Peshkin and Pfeffer (Peshkin & Pfeffer, 2003) define the Information Extraction as the task of filling template information from previously unseen text which belongs to a pre-defined domain.

One of the main reasons to use IE is its role in the evaluation and comparison of different Natural Language Processing technologies in domains highly influenced by human interactions, like the medical domain.

The IE systems can be classified based on two approaches:

  • Knowledge Engineering (KE): This is focused on an empiric method or based on a domain corpus to develop efficient and robust Natural Language Processing systems (Kasabov 2006).

  • Machine Learning (ML): This has a well-known set of documents and outputs and uses a set of patterns to extract knowledge by means of Machine Learning techniques (Ethem, 2004).

Based on the ML approach, the IE can be seen as useful technique to extract information from Clinical Practice Guidelines (CPGs) with the aim of enhancing its formalization. Particularly one of the tools used in the medical domain is the Badger system (Soderland, et al., 1995), which is a text analysis tool to summarize medical patient records by extracting diagnoses, symptoms, physical findings, test results, and therapeutic treatments based on linguistic concepts.

There are other IE systems based on Machine Learning techniques such as SRV -Sequence Rules with Validation- (Freitag, 1998), which transforms the patterns learning problem into a classification problem; RAPIER (Califf, 1998) that uses pairs of test documents and fills templates; and WHISK (Sonderland, 1999) that uses learning rules to extract a set of text styles. These tools can be adapted to several domains.

Some authors consider the Information Extraction (IE) as a later stage in the Information Retrieval (IR) process (Marie-Francine, 2006), the main difference between both is that IE provides the exactly desired information, while IR is in charge of finding the documents wherein the desired information should appear. Some new technologies try to merge advantages from both, such as some web wrappers (XWRAP (Liu et al, 2000) or (Baumgartner et al., 2001)) that extract information from HTML documents and search answers (automatic response over punctual queries). In this sense, a wrapper is a program that retrieves information from different repositories, merging, and unifying them. The aim of a wrapper is to locate relevant information in a semi-structured data and put it into a self-described representation for further processing (Kushmerick et al., 1997).


3. Approach

Most representation languages for CPG are very powerful and complex. They can contain many different types of information and data. The main goal for the application of Information Extraction on CPG documents is to obtain the relevant text by means of natural language patterns which can be used in the formalization of the CPG. This approach is illustrated in Figure 1.

The approach facilitates the formalization process by using several intermediate representations that are obtained by stepwise procedures. The idea is to obtain an intermediate representation of a CPG in XML format for reasoning. Such intermediate representation takes into account all the most important pieces from the CPG (such as actions, processes, sequences, etc.). The final output is a XML representation. This approach is an extension and adaptation of the work carried out by Cem Akkaya (Akkaya, 2005), which is a basic method for IE. The initial idea is to enhance the performance of a preliminary prototype to match patient data against CPGs (Pech-May et al., 2009) within a more general Medical Decision Support System.

To make the extraction, some specific templates have been generated, which are filled by the desired information. To detect such information, a heuristic method is applied. The filled templates are later processed.

Figure 1.

Proposed IE approach.

Figure 2.

General structure for CPGs.

The input CPGs are chosen from the National Guideline Clearinghouse

(NGC) repository in XHTML format. Then, XHTML documents are analyzed to extract relevant information, and subsequently to obtain the intermediate representation. Such representation is displayed through templates in form of views by means of the Prefuse

The Prefuse toolkit is a set of software tools for creating interactive data visualizations for Java language.

tool (Jeffrey et al., 2005). The approach considers that tested CPGs follow the structure of the NGC repository since these CPGs have a predefined structure. This approach works only for textual CPGs because in graphical or chart representation few text is included, relevant text (from medical experts) is mandatory for the approach. In general, a GPC has a structure consisting of separated sections for the treatment of a disease. For example, Figure 2 shows the general sections for diagnosis and treatment of a disease.

A general flowchart for a CPG is depicted in Figure 3, based on the general structure of a CPG.

In order to obtain information from CPGs, this approach is based on the transformation of multiple processes following three heuristic patterns for Information Extraction:

  1. Phrase pattern level (lexical level)

  2. Sentence pattern level (syntactic level)

  3. Speech pattern level (semantic level)

It is necessary a parsing process to obtain the extraction rules. This is based on a knowledge engineering approach considering syntactic and semantic restrictions, and taking into account delimiters.

In order for processing a large amount of documents and information, it is necessary specific heuristics for each type of information required, for example:

  1. Different types of information, in which each type of information needs specific methods for its processing (e.g. processes, parameters).

  2. Different representations of information, in which it should be taken into account that the information could be represented in different ways (structured, semi-structured, or plain text).

  3. Different types of guidelines, in which there may be CPGs for different diseases, diverse user groups, and several organizations that may contain similar CPGs.

Figure 3.

General flowchart for CPGs

The core of the approach is based on the atomic approach (Appelt & Israel, 1999), which basic idea is to assume that every noun phrase and verb of the right type, independently of the syntactic relations obtained among them, indicate an event/relationship of interest. It does not take into account the accuracy of the data extraction. Subsequently to the extracted data, a segmentation and filtering process is performed for its depuration. In this way, only the data concerning to the information of interest (diagnosis, treatment, drugs, etc.) is obtained. These data is stored in specific templates for further processing. The medical terms used in our prototype come from the Medical Subject Headings (MeSH)

of the National Library of Medicine from United States. Next, each heuristic pattern is briefly described.

3.1. Phrase pattern level

In this stage a lexical parser is used. It has the responsibility of splitting the text in paragraphs, tokens, and identifies important phrases in the CPG (e.g. administration of a drug, surgical procedures, dose of a drug, etc.). The lexical analyzer function is to identify the relevant information and then extract important data from the CPG. The lexical analyzer is in charge of filtering, for the second level of IE (sentence pattern level), the information that can be used by the syntactic level. They are defined by regular expressions as:

  • Action terms (mainly verbs; e.g., “activate”, “perform”, “prescribe”, “treat”, “integrate”, “receive”, etc).

  • Condition terms (regular expressions describing a condition, such as “if [, : \.]+”, “in case(s)? [, : \.]+”, “if [_,2 weeks]+”, etc.).

  • Time Annotations (e.g. [ESS, LSS], [EFS, LFS], [MinDu, MaxDu], REFERENCE)

  • Dose unit terms (e.g. “(m|d|c)?(l|g)(/kg/day)?”, “drop(s)?”, “teaspoon(s)?”, “tsp”).

3.2. Sentence pattern level

In this level (syntactic level), the entire document is parsed and split into sentences. Then every sentence is processed with regard to its context within the document and its group affiliation. Thereby, the context is obtained by captions (e.g. “Acute Pharyngitis in children Algorithm Annotations | Treatment | Recommendations:”) and a group contains sentences from the same paragraph or the same list, if there are no sublists. Thus, each sentence is now checked for relevance. Useful medical terms and keywords to identify medical actions can be found. The words or groups of words are mainly verbs indicating the application of a therapy, administration of a drug or a surgical procedure. This level considers two groups of patterns:

  • Free text pattern. It is used to identify paragraphs from a list of items. The pattern indicates therapy instruments (surgical procedures) combined with key terms (e.g. prescribe, indicate, execute, etc).

  • Concise text pattern. It is used to detect specific defined patterns such as lists of items with incorrect grammar. In general, it denotes the right therapy to apply, instruments for the therapy or drugs. These can be merged with other detected labels in the sentence or phrase pattern level.

Detecting relevant sentences is a challenging task, which is undertaken in two steps:

  1. detecting irrelevant sentences to exclude them from further processing, and

  2. detecting relevant sentences.

In both steps, special keywords are used to detect whether a sentence is irrelevant or relevant. Keywords describing irrelevant sentences are “history”, “diagnosis”, “criteria”, “symptom”, “clinical assessment”, “risk factor”, “complicating factor”, “etiology”, and so on. These terms point out that the following paragraph does not describe treatment processes, but that it describes symptoms, demonstration of diagnoses, and so on. If such a term appears within a caption the corresponding section is removed.

3.3. Speech pattern level

In this level semantic aspects are solved and the design and the structure of the final document XML are improved. In addition, this level is used to categorize sentences, actions, and to find their relationships. To accomplish the later task the following processes in the CPG are identified:

  1. processes with temporal dependencies (processes at some point depending on another process),

  2. sequential processes (the processes that are required to run with the authorization of other),

  3. processes containing a thread,

  4. selection process, and

  5. recurring processes.

The application of the extraction rules gives as a result a well-structured XML document which can be represented by using specific templates or graphical forms (by using the Prefuse tool):

  • Templates. It is the final representation of the Information Extraction module; these can be filled with specific CPG data. After collected all the relevant phrases from the CPG, the document is generated by using the representation of sentences, actions, relationships, and hierarchical structure. The representation is done through a document markup listing all relevant sentences and an identification of MeSH terms. Thus the document contains information for a dose, duration, actions, administration of a disease, etc.

  • Graphical representation. The generated document is represented visually using Prefuse. In this way, it provides data to optimize table structures, tree graph design, visual encoding techniques, dynamic queries, integrated search, and database connectivity.


4. Experiments

A first implementation of this approach was developed by using Java language. For a performance analysis, different CPGs corresponding to different specific diseases were employed, which are:

  • Diagnosis and treatment of otitis media in children

  • Diagnosis and treatment of diabetes-mellitus (type 2)

  • Acute pharyngitis in children

  • Diagnostics and treatment of jaundice

  • Management and treatment of dengue hemorrhagic fever at first and second attention level

  • Treatment of breast cancer

  • Chronic cough in a child

The CPGs were divided into two groups:

  1. Clinical guidelines to develop and improve the heuristics

  2. Clinical guidelines to test the obtained heuristics

The choice of these groups is not a trivial task because the organizations that develop CPGs do not regularly take care of following the same hierarchical structure. In this experiment, complex hierarchical structures were used as selection criteria, and distributed evenly to each group. Before applying the heuristics, some pre treatment was carried out (to verify that XHTML documents satisfy the structuring elements). This is achieved through the conversion of paragraphs/sections from the CPG and their corresponding items (according to the three pattern levels). Our test considered the following two tasks:

  • Task 1: detection of relevant sentences, and

  • Task 2: summarization of the detection types of sentence and the relationship between processes.

The performance of the prototype was evaluated by using the precision and recall measures. The recall score measures the ratio of correct information extracted from the texts against all the available information present in the text. The precision score measures the ratio of correct information that was extracted against all the information that was extracted (Lehnert et al., 1994). The following summarizes the obtained results by task:

  • Task 1: It obtains promising results (Table 1), even if it means lowering the precision punctuation. The lower recall score implies that detecting relevant sentences has to be improved. The high accuracy on precision score shows that irrelevant sentences were classified as relevant.

  • Task 2: The entry for task 2 (Table 2) consists of sentences identified with very high punctuation in the previous task. The recall score is very high, which means that only few sentences were falsely not detected. The precision score implies that some slots were filled out incorrectly. The reason for this is that they do not always detect the correct type of sentence and specially when assigning annotations to their particular actions, situation that has to be improved.

Table 3 presents an overall evaluation. For all the tables, the nomenclature for columns is:

COR –Number of correct slots that were identified by our IE system

MAT –Total number of slots that match a CPG template in the CPGs group

IDE –Total number of slots that were identified by our IE system

REC: Represents our system recall that is given by COR/IDE

PRE: Represents our system precision that is given by COR/MAT

At the phrase pattern level several regular expressions were necessary. Figure 4 shows a fragment for a pattern in this level. At the sentence pattern level the text free patterns were identified, such as <p> and </p> (to identify de paragraphs), <li> and </li> (to identify lists of items), and some additional labels. These labels are combined with the labels from de phrase pattern level like <dosage> or </dosage>, <dose> or </dose>, etc.

After each CPG was processed in the three analysis stages (phrase level, sentence level, and speech level), an intermediate representation was obtained. For this, two files were generated, the first one containing the list of relevant sentences (see Table 5) and a second one which is a mark-up document (see Table 6).

The intermediate representation shown in Table 6 contains a set of actions and relations. An action contains sentences describing the action and annotation assigned by means of the DELT/A tool. It also contains the instrument for the treatment and an identifier within the MeSH dictionary. If the information is about a dose, duration of treatment or drug management, then a corresponding MeSH identifier is assigned to it. Table 7 partially shows the actions and its assigned MeSH identifiers for the CPG “Diagnosis and treatment of otitis media in children”.

Diagnosis and treatment of otitis
media in children
Diagnosis and treatment of
diabetes-mellitus (type 2)
Acute pharyngitis in children912100.750.9
Diagnostics and treatment of jaundice5356530.9461
Management and treatment of dengue hemorrhagic fever at first and second attention level6568750.9550.866
Treatment of breast cancer5659740.9460.756
Chronic cough in a child2331260.7410.884

Table 1.

Evaluation of Task 1 for each CPG.

Diagnosis and treatment of otitis
media in children
Diagnosis and treatment of
diabetes-mellitus (type 2)
Acute pharyngitis in children5658670.9650.835
Diagnostics and treatment of jaundice3642450.8570.8
Management and treatment of dengue hemorrhagic fever at first and second attention level7375750.9730.973
Treatment of breast cancer86116980.7410.877
Chronic cough in a child1418200.7770.7

Table 2.

Evaluation of Task 2 for each CPG.

Task 13554434320.8010.821
Task 28498659780.9810.868

Table 3.

Overall evaluation results

<number"/> ([\d]+(([\.]([\d]+))|((\s*[\d]+)?/[\d]+))?)
<numberOrRange"/> <number"/>(((_to_)|(\s*-\s*))<number"/>)?
<time-unit"/> m(illi)?)?sec(ond)?(s)?|min(ute)?(s)?|hour(s)?|
day(s)?|week(s)? ...
<dose-unit"/> (m|c|d)?(l|g)(/kg(/<time-unit"/>)?)?|drop(s)?| tab(s)? ...
<dosage"/> <numberOrRange"/>[\s]*<dose-unit"/>
<time"/> <numberOrRange"/>[\s]*<time-unit"/>
<iteration"/> TID|BID|QD|(Q|every) <time"/>|
<numberOrRange"/> _(times|doses)_(per|a)_<time-unit"/>
<person"/> those|patient(s)?|person(s)?|child(ren)? ...
<condition"/> (in_(case(s)?|areas)|if|unless|who(m)?)_[ˆ,:]+ |
In_.*allergic [ˆ,\.:]+ | (for|in)_(a_)?(<person"/>) [ˆ,\.:]+

Table 4.

Examples of phrase level patterns.

<delta-link link-id=”8”/"/>
<description"/>In children with risk factors for Streptococcus pneumoniae,
it is recommended that Amoxicillin, high dose (80 to 90
mg/kg/day) or Augmenting (with high dose amoxicillin component)
be utilized as first-line therapy (Nash and Wald, 2001 [S];
Wald, Chiponis, and Ledesma-Medina, 1986 [B]; Nelson, Mason,
and Kaplan, 1994 [C]; Dowell et al., 1999 [E]; Dowell, 1-1998
[E]; Friedland and McCracken, 1994 [E]; Local Expert Consensus
<delta-link link-id=”9”/"/>
<description"/>Note: Failure with amoxicillin is likely to be due to resistant
Streptococcus pneumoniae, Haemophilus influenzae, or Moraxella
<delta-link link-id=”10”/"/>
<description"/>High dose amoxicillin will overcome Streptococcus pneumoniae
resistance (changes in penicillin-binding proteins)
(Dowell et al., 1999 [E]; Whitney et al., 2000 [D]).

Table 5.

Fragment of the relevant sentence file corresponding to the GPC “Diagnosis and treatment of otitis media in children”.

<a id="delta:8""/>In children with risk factors for Streptococcus
pneumoniae, it is recommended that Amoxicillin, high dose
(80 to 90 mg/kg/day) or Augmenting (with high dose
amoxicillin component) be utilized as first-line therapy
(Nash and Wald, 2001 [S]; Wald, Chiponis, and Ledesma-
Medina, 1986 [B]; Nelson, Mason, and Kaplan, 1994 [C];
Dowell et al., 1999 [E]; Dowell, 1 -1998 [E]; Friedland
and McCracken, 1994 [E]; Local Expert Consensus [E]).
<ul type="disc""/>
<a id="delta:9""/>Note: Failure with amoxicillin is likely to be due
to resistant <Streptococcus pneumoniae, Haemophilus
influenzae, or Moraxella catarrhalis.
<a id="delta:10""/>High dose amoxicillin will overcome Streptococcus
pneumoniae resistance (changes in penicillin-
binding proteins) (Dowell et al., 1999 [E]; Whitney
et al., 2000 [D]).
The clavulanic acid component of Augmentin is active against
Resistant Haemophilus influenzae and Moraxella catarrhalis (B-
lactamase enzyme) (Wald, Chiponis, and Ledesma-Medina, 1986 [B];
Dagan et al., 2000 [A]).

Table 6.

Fragment of the mark-up document file corresponding to the GPC “Diagnosis and treatment of otitis media in children”.

With the obtained actions from a CPG, it is possible transform the CPG into an Asbru document. At this moment this step is carried out manually.

In Asbru, a plan is represented by means of plans definitions. A plan contains a plan name, arguments, knowledge role, and a plan body. Table 8 shows an example for a fictitious plan following the Asbru specification. In Table 9 can be seen fragment of sentences, actions and plans for the CPG Diagnosis and treatment of otitis media in children in Asbru.

<action id="8" parent="5" group="18" selection="0""/>
<delta-link link-id="8"/"/>
<description"/>In the child with no risk factors for penicillin-resistant Streptococcus
pneumoniae standard dose amoxicillin or Augmentin (with standard
dose Amoxicillin component) may be considered as initial therapy.
<agent MeSH="D000658" name="amoxicillin"/"/>
<agent MeSH="D019980" name="Augmentin"/"/>
<item"/>In the child with no risk factors for penicillin-resistant Streptococcus
<annotation"/>Note: Forty-six percent of isolates at Children’s Hospital Medical
Center of Cincinnati, Ohio have intermediate or high
Penicillin-resistant Streptococcus pneumoniae and local data
supports that 15% of children locally may fail initial therapy
with standard dose amoxicillin.
<delta-link link-id="9"/"/>
<item"/>Antibiotic Treatment</item"/>

Table 7.

Partial actions corresponding to the GPC “Diagnosis and treatment of otitis media in children”

TIME ANNOTATION([ , ], [ ,24 hours], [ , ], *NOW*)
PREFERENCESSelect-method: exact-fit
INTENTIONSAvoid intermediate state: (glucose-level = high)
CONDITIONSAbort-condition: (glucose-level = high)
Filter-condition: ((patient-age "/> 60) AND (patient-age < 80))
EFFECTSPlan-effect: Parameter="glucose-level"
Likelihood 0.65
PLAN_BODYParallel subplans:
Continuation spaci_cation: (treatment-1 OR treatment-2)

Table 8.

Example of a fictitious plan in Asbru

<treatment title="Diagnosis and treatment of otitis media in children.""/>

<delta-link link-id="14"/"/>
<description"/>Therapeutic (10 day)
course of antibiotics.</description"/>
<delta-link link-id="15"/"/>
<description"/>Consideration may be given to a shortened course of antibiotics (5 days) for children who are at low risk (i.e., age "/> 2 years, no history of chronic or recurrent otitis media and intact tympanic
<delta-link link-id="16"/"/>
<delta-link link-id="17"/"/>
<description"/>amoxicillin (40
mg/kg/day) if
low risk ("/> 2 years, no day care, and
no antibiotics for the past three
<delta-link link-id="18"/"/>
<description"/>80 mg/kg/day if not
low risk or for resistant AOM
if the lower dose
was used initially .</description"/>

<treatment title="Diagnosis and treatment of otitis media in children.""/>
<action group="3" id="1" parent="0""/>

<action group="9" id="14" parent="12""/>
<delta-link link-id="14"/"/>
<description"/>Therapeutic (10 day)
course of antibiotics.</description"/>
<agent MeSH="D000900"
<duration term="10 day"/"/>
<agent MeSH="D000900"
<duration term="5 days"/"/>
<annotation"/>Consideration may be
given to a shortened course of
antibiotics (5 days) for children
who are at low risk
(i.e., age & "/> 2 years, no history of
chronic or recurrent otitis media
and intact tympanic membranes).
<delta-link link-id="15"/"/>
<annotation"/>The use of nasal
decongestants and corticosteroids
is not supported in the literature.
<delta-link link-id="34"/"/>


Table 9.

Fragments for the GPC ” Diagnosis and treatment of otitis media in children”; a) fragment of sentences, b) fragment of extracted actions; continue in the Table 10.

<plan name="PLAN_PARENT_2"
title="Therapeutic (10 day) course of antibiotics.""/>
<setup-precondition confirmation- required="yes""/> <none/"/>
<plan-activation"/> <plan-schema name="PLAN_PARENT_1""/>
<delta-link link-id="1"/"/>
<plan name="PLAN_14" title="Therapeutic (10 day) course of antibiotics.""/>
<delta-link link-id="14"/"/> <delta-link link-id="15"/"/>
<delta-link link-id="34"/"/>
<explanation text="Consideration may be given to a shortened course of antibiotics (5 days) for children who are at low risk (i.e., age & "/> 2 years, no history of chronic or recurrent otitis media and intact tympanic membranes). The use of nasal decongestants and corticosteroids is not supported in the literature."/"/>
<subplans type="unordered""/> <wait-for"/> <all/"/> </wait-for"/>
<plan-activation"/> <plan-schema name="PLAN_16""/>
<delta-link link-id="16"/"/>



Table 10.

Continuation from Table 9c) fragment from the transformation of actions to the Asbru format.

The above actions can be seen graphically as a tree graph by using the Prefuse tool. This view enhance the support to the clinical staff about identifying, in a easy way, what are the symptoms in the patient to decide a dose for a drug or the right therapy. Figure 4 shows a small fragment for a visual plan of the CPG Diagnosis and treatment of otitis media in children.


5. Conclusions

This paper describes a basic Information Extracting approach applied to obtain knowledge from Clinical Practice Guidelines. The final objective of this work is to obtain an intermediate representation of actions from a textual CPG in XML format by means of an Information Extraction module. The approach applies three heuristics using specific expression patterns over the structure of CPG documents. Through the application of generic Information Extraction heuristic rules, a single formatted document is obtained, which contain the lists, sub-lists, and paragraphs from the original CPG. This document is an intermediate knowledge representation in XML format. The result of the extracted information is used to fill individual slots templates, which represent processes and their relationships in a CPG document. It can be translated into two formal representations: 1) Asbru language (although other languages can be used) and 2) A graph representation by using the Prefuse tool. The aim of the second option is to show the hierarchical structure of the CPG; thus a physician can see in a graphical view the symptoms and routes on the CPG where the patient can be directed for an action or therapy.

Figure 4.

Example of a visual plan for the CPG Diagnosis and treatment of otitis media in children.

To obtain the actions for a CPG, three stages are necessary for phrases (lexical), sentences (syntactic) and semantic. The first stage is the phrase pattern level wherein a CPG document is lexically analysed; the document is tokenised, relevant phrases are identified and important data is detected. This level filters, to the sentence pattern level, only information used within the syntactic level. The second stage is basically a syntactic analyser, it uses the relevant phrases or identified tokens in the lexical level. At this level medical terms and keywords, to identify medical actions, are identified. The set of terms consist mainly of verbs denoting the application of a therapy, administration of drugs or surgical procedure. This level is divided in two groups of patterns: text free pattern and concise text pattern. The third stage is the speech pattern level where the design and structure of the document is enhanced. It categorises sentences and finds their relations. The approach has been implemented in a first prototype. The experiments show that the proposed heuristic-based approach can achieve good results, especially for CPG with a major portion of semi-structured text. The obtained intermediate representation may be used in a next stage for a better formalisation of the CPG.

As a future work the rules for processing CPGs containing complex information will be improved. Another goal is to create a support model with the ability for evaluating plans that are contained in CPGs.



This research was partially funded by project number 153880 from “Fondo Mixto Conacyt-Gobierno del Estado de Tamaulipas” and project number TAB-2010-C19-144199 from "Fondo Mixto Conacyt-Gobierno del Estado de Tabasco".


  1. 1. Aguirre-JuncoA. R.ColombetI.ZuninoS.JaulentM. C.LeneveutL.ChatellierG.2004Computerization of guidelines: A knowledge specification method to convert text to detailed decision tree for electronic implementation, Stud Health Technol Inform.10711159
  2. 2. AkkayaC.2005Extracting process information from clinical practice guidelines. Master’s thesis, Vienna University of Technology.
  3. 3. AppeltD. E.IsraelD. J.1999Introduction to information extraction technology, A tutorial prepared for IJCAI-99, Stockholm,scheweden.
  4. 4. BaumgartnerR.FlescaS.GottlobG.2001VisualWeb Information Extraction with Lixto, Proceedings of the 27th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 119128
  5. 5. BosseT.2001An interpreter for clinical guidelines in asbru. Master’s thesis, Department of Artificial Intelligence, Faculty of Sciences. Vrije Universiteit Amsterdam. Amsterdam, The Netherlands.
  6. 6. CaliffM.1998Relational Learning Techniques for Natural Language Information extraction. Ph.D. thesis, Department of Computer Sciences, University of Texas, Austin, TX, USA
  7. 7. CiccareseP.CaffiE.BoiocchiL.QuagliniS.StefanelliM.2004A guideline management system, Proceedings of 11th World Congress of the International Medical Informatics Association, IOS Press, San Francisco, USA, 2832
  8. 8. CIE-9 -CM2010International Classification of Diseases, Ninth Revision, Clinical Modification. National Center for Health Statistics of USA. USA.
  9. 9. ClercqP. A.BlomJ. A.KorstenH. H.HasmanA.2004Approaches for creating computer-interpretable guidelines that facilitate decision support, Artificial Intelligence in Medicine, 311127
  10. 10. DartT.XuY.ChatellierG.DegouletP.2001Computerization of guidelines: Towards a “guideline markup language”, 186190
  11. 11. EthemA.2004Introduction to Machine Learning (Adaptive Computation and Machine Learning), The MIT Press.
  12. 12. FieldM. J.LohrK.1992Guidelines for Clinical Practice: From development to use, Institute of Medicine, National Academy Press.
  13. 13. FreitagD.1998Machine Learning for Information Extraction in Informal Domains. Ph.D. thesis, Computer Science Department, Carnegie Mellon University. Pittsburgh, PA, USA
  14. 14. FuchsbergerC.MikschS.2002Asbru’s execution engine: Utilizing guidelines for artificial ventilation of newborn infants. Technical report, Vienna University of Technology, Institute of Software Technology and Interactive Systems.
  15. 15. GeldofM.2002The formalization of medical protocols: easier said than done. Master’s thesis, Department of Artificial Intelligence, Faculty of Sciences. Vrije Universiteit Amsterdam. Amsterdam, The Netherlands.
  16. 16. GennariJ. H.MusenM. A.FergersonR. W.GrossoW. E.CrubézyM.ErikssonH.NoyN. F.SamsonW. T.2002The evolution of protégé: An environment for knowledge-based systems development, International Journal of Human Computer Studies, 58189123
  17. 17. HripcsakG.ClaytonP. B.PryorT. A.HaugP.WigertzO. B.2005The arden syntax for medical logic modules, International Journal of Clinical Monitoring and Computing, 104215224
  18. 18. IsernD.MorenoA.2008Computer-based execution of clinical guidelines: A review, International Journal Medical Informatics, 7712787808
  19. 19. JeffreyH.StuartK. C.JamesA. L.2005Prefuse: a toolkit for interactive information visualization, Proceedings of the SIGCHI conference on Human factors in computing systems, ACM, Portland, Oregon, USA, 421430
  20. 20. KaiserK.MikschS.2005Modeling computer-supported clinical guidelines and protocols. Technical report, Vienna University of Technology, Institute of Software Technology and Interactive Systems, Vienna.
  21. 21. KarrasB.DeshpandeA.PolvaniK.AgrawalA.ShiffmanR. N.2000Gem cutter manual. Yale Center for Medical Informatics.
  22. 22. KasabovN.2006Evolving Connectionist Systems: The Knowledge Engineering Approach, Springer-Verlag New York, Inc.
  23. 23. KohnL. T.CorriganJ. M.MollaS.2000To Err Is Human: Building a Safer Health System, National Academy Press, Washington, D.C.
  24. 24. KosaraR.MikschS.SeyfangA.VotrubaP.2002Tools for acquiring clinical guidelines in asbru, Proceedings of the 6th World Conference on Integrate Design and Process Technology (IDPT’02), Society for Design and Process Science, New York, 2227
  25. 25. KushmerickN.1997Wrapper induction for information extraction, Ph.D. thesis, Department of Computer Science and Engineering, University of Washington.
  26. 26. LehnertW.CowieJ.1994Evaluating an information extraction system, Commun. ACM, 3918091
  27. 27. LiuL.PuC.HanW.2000XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources, Proceedings 16th International Conference on Data Engineering (ICDE), IEEE Computer Society, Washington, DC, USA, 611621
  28. 28. LyngK. M.HildebrandtT.MukkamalaR. R.2008From paper based clinical practice guidelines to declarative workflow management, Business Process Management Workshops, Springer, Milano, Italy, 336347
  29. 29. Marie-FrancineM.2006Information Extraction: Algorithms and Prospects in a Retrieval Context (The Information Retrieval Series). Springer-Verlag.
  30. 30. PatwardS.RiloffE.2006Learning Domain-Specific Information Extraction Patterns from the Web, Proceedings of the ACL 2006 Workshop on Information Extraction Beyond the Document, Association for Computational Linguistics, Stroudsburg, PA, USA, 6673
  31. 31. Pech-MayF.Lopez-ArevaloI.Sosa-SosaV.2009Toward the validation of patient data for clinical practice guidelines. In Proceeding of the 6th International Conference on Electrical Engineering, Computing Science and Automatic Control. Toluca, Mexico, 467472
  32. 32. Pech-MayF.Validator for Clinical Practice Guidelines in Patients. (2010Master’s Thesis, Laboratory of Information Technology, Cinvestav, Tamaulipas, Mexico.
  33. 33. PeshkinL.PfefferA.2003Bayesian information extraction network, Proceedings Of the 18th International Joint Conference on Artificial Intelligence (IJCAI), Morgan Kaufmann Publishers Inc., Acapulco, Mexico, 421426
  34. 34. PatelV. L.ArochaJ.DiermeierM.HowJ.Mottur-PilsonC.2001Cognitive psychological studies of representation and use of clinical practice guidelines, International Journal of Medical Informatics, 633147167
  35. 35. RůzickaM.SvatekV.2004Mark-up based analysis of narrative guidelines with the stepper tool, Journal of Studies in health technology and informatics, 1011132136
  36. 36. SoderlandS.AronowD.FisherD.AseltineJ.LehnertW.1995Machine learning of text analysis rules for clinical records, Tr 39, Center for Intelligent Information Retrieval.
  37. 37. SonderlandS.1999Learning information extraction rules for semi-structured and free text. Machine Learning, Kluwer Academic Publishers, 341233272
  38. 38. SuttonD. R.FoxJ.2003The syntax and semantics of the proforma guideline modeling language, Journal of the American Medical Informatics Association (JAMIA), 105433443
  39. 39. TeijeA. T.MarcosM.BalserJ.CroonenborgV.DuelliC.HarmelenF. V.LucasP. J.MikschS.ReifW.RosenbrandK.SeyfangA.2006Improving medical protocols by formal methods, Artificial Intelligence in Medicine, 361193209
  40. 40. TerenzianiP.MontaniS.BottrighiA.MolinoG.TorchioM.2005Clinical guidelines adaptation: managing authoring and versioning issues, in: S. Miksch, J. Hunter, E. Keravnou (Eds.), Proceedings of 10th Conference on Artificial Intelligence in Medicine (AIME 2005), Springer-Verlag, Aberdeen, Scotland, 151155
  41. 41. TuS. W.MusenM.2001Modeling data and knowledge in the EON guideline architecture, Proceedings of 10th Triennial Congress of the International Medical Informatics Association (MEDINFO 2001), Studies in Health Technology and Informatics, IOS Press, London, UK, 280284
  42. 42. TwaddleS.2005Clinical practice guidelines, Singapore Medical Journal 4612681687
  43. 43. VotrubaP.MikschS.KosaraR.2004Facilitating knowledge maintenance of clinical guidelinesand protocols, Proceeding of 11th World Congress Of Medical Informatics, Studies in health technology and informatics, IOS Press, Amsterdam, Netherlands 5761
  44. 44. WangD.PelegM.TuS. W.BoxwalaA. A.OgunyemiO.ZengQ.GreenesR. A.PatelV. L.ShortliffeE. H.2004Design and implementation of the GLIF3 guideline execution engine, Journal of Biomedical Informatics, 371305318
  45. 45. YoungO.ShaharY.LielY.LunenfeldE.BarG.ShalomE.MartinsS. B.VaszarL. T.MaromT.GoldsteinM. K.2007Runtime application of Hybrid-Asbru clinical guidelines, Journal of Biomedical Informatics, 401507526


  • International Classification of Diseases. This is a classification of diseases and procedures used in the coding of clinical information derived from medical assistance, mainly in the hospital environment and specialized medical care centers.
  • The Prefuse toolkit is a set of software tools for creating interactive data visualizations for Java language.

Written By

Fernando Pech-May, Ivan Lopez-Arevalo and Victor J. Sosa-Sosa

Published: 06 September 2011