Software tools for graph-based analysis of biological networks.

## 1. Introduction

The interactions that occur among different cells and inside themselves are essential for maintaining the fundamental life processes in biological systems. These interactions are generally described as networks interconnecting an enormous number of nodes and are characterised by extremely complex topologies. Among such networks, the most studied are typically classified as metabolic pathways (the flow of material and energy in the organism), gene regulatory cascades (when and which genes are expressed) and protein signalling networks (triggering an immune response to viral attack). The Pathguide resource list offers detailed information on several biological pathway and network databases accessible through the web (http://www.pathguide.org, 2009 - Bader et al., 2006).

Mathematical objects such as graphs, consisting of nodes and links joining the nodes (arcs and edges), offer a convenient and effective level of abstraction for representing the aforementioned networks. These graphs are generally referred to as biological interaction networks.

Biologists can benefit from graph-based formalisms since they are able to visualize the overall (or certain parts) topology of the network, modify nodes and links and annotate them with additional information. Furthermore, many biological questions can be conveniently formulated in terms of search or optimization problems on graphs. For example, it is currently believed that the topological structure of biological networks (e.g. cell signalling pathways) reveal its characteristics of robustness and evolutionary history. The main lines of thought in this field describe biological interaction networks by employing formalisms based on either undirected or directed graphs; we describe them in section 2.

The complexity of biological networks is not only driven by the enormous amounts of variables and the relationships among them; additional sources of uncertainty are due to the difficulty of observing cellular dynamics in 3D-space. Thus, developing quantitative mathematical models of cellular dynamics in large networks may be computationally infeasible. However, structural analysis and hypothesis formulation (and validation) may reveal valuable information on the underlying biological functions at a system level (e.g which molecules and information processing pathways are involved in the accomplishment of a biological function). For this reason, it is essential for biologists working on such networks under this perspective to have access to open and extensible computing platforms with advanced functionalities and efficient graph rendering routines. To this end, we summarise (section 4) the features of an exemplary subset of open source software packages typically used for the exploratory analysis of biological interaction networks of variable type and size.

We conclude the chapter by observing that currently there is a lack of quantitative and comprehensive frameworks for linking network topology to biological function.

## 2. Graph-based formalisms for representing biological interaction networks

Graphs are widely used as a mathematical formalism for representing networks. In this section we review directed and undirected graph-based paradigms used to represent and study biological networks. Directed graphs can well describe causal relationships among entities whereas undirected graphs are better suited for expressing soft constraints between nodes.

### 2.1. Undirected Graphs

A biological interaction network may be represented by an undirected graph, a mathematical structure G(V,E) composed of a set of vertexes V and edges E (

Interesting matters to investigate about BSGs are, for example, determining whether they share the same topological properties of other types of networks (e.g. social or communication networks), their robustness (e.g. against viral attacks or mutations) and if they are self-organizing.

To this end, qualitative approaches based on statistical characteristics of network structure have been widely applied for studying the topology of large-scale networks Albert, R. & Barabasi, A.-L., 2002, Watts, D.J. & Strogatz, S.H., 1998, see Jeong et al., 2001 for a case study of protein-protein interaction networks). These methods try to exploit massive datasets in order to discover patterns that would otherwise be undetectable at smaller scales.

Among the most frequently used and perhaps most significant topological statistics are the average path length *l,* defined as the average number of links along the shortest paths for all possible pairs of network nodes and the average clustering coefficient

where the clustering coefficient*i* provides a quantitative measure of how close to a clique is the subgraph obtained by considering the vertex *i* and all its neighbors*.*

In random networks Erdos, P. & Renyi, A., 1960, an edge connecting two arbitrary nodes is added to the graph as the result of a random experiment (the probability of accepting the edge is equal to a fixed*l* and

Recent studies have revealed that many complex systems present small-world and scale–free properties Albert, R. & Barabasi, A.-L., 2002. A small-world network has a significantly larger

It has been conjectured that a self-organizing mechanism based on the preferential attachment rule can explain the dynamics of scale-free networks. Preferential attachment, which naturally leads to the power-law degree distribution, implies that when a new node is added to the network it is attracted by the vertexes with the highest degree. This is in contrast to random graphs where the addition of nodes follows a pure probabilistic process, meaning that nodes are equally important in a statistical sense (no node is privileged).

Research has shown that the functions of a complex system may be affected to a large extent by its network topology Albert, R. & Barabasi, A.-L., 2002. For example, the greater tendency for clustering in metabolic networks appears related to the organization of functional modules in cells, which contribute to the behaviour and survival of organisms.

In a biological network this behaviour is plausible (and desirable) since the network converges toward a structure that favours the efficiency (low-energy reactions) of its functions (many real world networks ultimately exhibit this structure). Some researchers explain this behaviour as the evidence of evolutionary history; i.e., nodes with high degrees have resisted selective pressure since they are important for the survival of the organism (acting as hubs and thus are essential for the correct functioning of the network). It is this structure that makes such networks tolerant to random failure and errors (e.g. missense point mutations) but less vulnerable to targeted attacks (e.g. removal of nodes with highest degree) Holme et al., 2002, Bollobas & Riordan, 2003. However, while this a posteriori analysis seems consistent with observations of real biological networks there are still many open questions that naturally arise regarding the way they are actually formed.

### 2.2. Directed Graphs

When reactions are also represented as vertexes

BPGs based on bipartite graphs are characterised by the following properties (which immediately follow from the definition of bipartite graph):

the sets

the edge set

the property

The BPG directed graph-based formalism is appealing since:

both molecules and reactions are viewed as vertexes (which typically indicate the information content of the graph while edges carry the information concerning its topological structure) thus making certain types of analysis simpler (e.g. when taking into account the enhancing or inhibiting character of reactions);

the directionality of the reactions and information flows in the network are made explicit;

there is (at least in principle) a greater compatibility with richer (quantitative) models that are able to take into account system dynamics (e.g. Petri nets).

The analysis of a BPG is essentially performed by working on the corresponding adjacency matrix A of size

An alternative representation is given by the adjacency list L; in this case the list of nodes (either molecules or reactions) for which there exists an arc originating from the i-th node are stored in the corresponding position of the array L(i). Graph analysis algorithms are often more efficient when implemented using adjacency lists.

Of notable interest are the sets

The in- and out-degrees of a molecule node are extremely informative since one can identify “source” molecules (with in-degree equal to 0 that act only as substrates) and “sink” molecules (with out-degree equal to 0 that act only as products). Nodes with large in/out-degrees (also called “hub” nodes) are typical of many reaction networks that exhibit the scale-free topology and small-world structure as discussed above.

It is interesting to explore a BPG by searching for the induced subgraph originating from a particular vertex set. The basic use of this algorithm is for identifying the reactions and substrates involved in the transformations between any two (or set of) molecules. In particular, the induced subgraph originating from a certain vertex

Path enumeration algorithms identify the paths originating in a vertex

An interesting line of research was recently opened by (Ruths et al., 2006) who formulated two biologically significant problems on BPGs: the constrained downstream (CDP) and minimum knockout (MKP) problems. The CDP essentially searches a BPG for the set of reactions that leads from one set of species to another, with the constraint that the search algorithm must include and/or exclude a given set of reactions. An application of a CDP instance to a BPG would therefore be useful to design drugs that modify or inhibit certain biological functions while preserving others. From a topological point of view this amounts to identifying molecules or sets of molecules that have to be targeted to inhibit function of a sub-network while preserving connectivity (and thus information flow) to a different sub-network. On the other hand, the MKP seeks for a minimum-size set of molecules whose removal (knocking out) from the BPG disconnects given sets of source and target species thus making impossible the production of certain target molecules from the source set. An instance of the MDP may help solve the biological problem of identifying the optimal and minimal sets of molecules that have to be targeted to block network function. For example, in patients affected by cancer this could allow the development of therapeutics that can efficiently destroy cancer cells while preserving normal cells.

The above algorithms based on a depth-first search strategy usually require running times of

### 2.3. Relationship between directed and undirected graph representations

The transformation of the BPG_{u} may be calculated by the matrix sum B=A+A’ (or A’+A, A’ denotes the transpose of A) where A is the adjacency matrix of the directed graph G.

As an example, following this procedure the BPG depicted in Figure 1-(a) is transformed into the undirected graph shown in Figure 1-(b), which is also bipartite. From a biological point of view, one must be aware that the undirected graph contains additional information since molecules that were reactants are now also products and vice versa; such newly introduced reactions are not necessarily biologically significant.

A further transformation step is illustrated in Figure 1-(c) where the undirected BPG is transformed into a BSR by cutting reaction nodes (all the information contained in the arcs entering and leaving the reaction is retained).

### 2.4. Petri Nets

In computer science, a Petri net (PN) is a graphical and mathematical modeling tool for the analysis and simulation of concurrent and/or stochastic systems Murata, 1989. A system is represented as a bipartite directed graph with a set of "place" nodes that contain resources (conditions), and a set of "event" nodes which generate and/or consume resources (transitions).

A PN models both the topology of the network as well as the dynamical behavior of the system. For this reason they have been employed to simulate biological networks Reddy et al., 1993, Tsavachidou & Liebman, 2002, Nagasaki et al., 2005, Chaouiya, 2007. In translating the biological network into a PN, each molecule and each interaction are described as a node. Each molecule is initialized with some number of “tokens,” which are then iteratively reallocated by the interactions to which they are connected.

## 3. Enhanced formalisms for representing biological networks

In this section we briefly describe graphical formalisms that are commonly used to represent complex biological systems. Although such formalisms are not always strictly graph-based (i.e. directed or undirected graphs), they are worthy of mention since they afford visual modelling approaches that humans can employ for the description, analysis and simulation of complex biological systems by computer programs. In general, with respect to mathematical graphs these notations encompass enriched and highly expressive formalisms that can describe many types of nodes and relationships.

These languages provide a consistent and complete formalism for the specification of biological models and thus facilitate model sharing in different software environments (models should be platform-independent).

It must be emphasized that currently system biology still lacks a standard graphical notation.

### 3.1. SBML

The Systems Biology Markup Language (SBML) is an open XML-based schema for representing quantitative biological models such as regulatory and metabolic networks Hucka et al., 2003, Finney & Hucka, 2003. SBML specifies a formalism for representing biological entities by means of compartments and reacting species, as well as their dynamic behaviour, using reactions, events and arbitrary mathematical rules Hucka et al., 2008. The SBML file format and underlying network structure is widely accepted and supported by a large number of tools (http://sbml.org, 2009).

### 3.2. BioPAX

The Biological Pathway Exchange Format (BioPAX) is a collaborative effort to create a data exchange format for biological pathway data (http://www.biopax.org, 2009). The BioPAX specification provides an abstract object model, based on the web ontology language (OWL), for the representation of biological pathway concepts. Mapping the concepts defined in any custom data model to the semantically corresponding items in the BioPAX model provides a unified formalism for the representation of biological interaction networks.

### 3.3. PSI-MI

The Proteomics Standards Initiative (PSI-MI) provides an XML-based formalism for the representation of protein-protein interaction data

(http://psidev.sourceforge.net/mi/xml/doc/user/, 2009). Future developments are aimed at expanding the language to include other types of molecules (e.g. DNA, RNA, etc).

### 3.4. SBGN

Recently, the level 1 specification of the Systems Biology Graphical Notation (SBGN) was released (http://www.sbgn.org, 2009) in an effort to provide a standardised graphical notation to describe biological systems in a format comprehensible to humans. In this respect, it may be considered a natural complement to SBML and to the aforementioned formalisms.

## 4. Graph-based Analysis Tools

It is essential for biologists working with biological networks to have access to open and extensible computing platforms with advanced functionalities and efficient graph drawing routines. For this purpose there are currently many commercial and open source tools available; a short selection is provided in Table 1 which highlights some essential features.

Tool | Graph Analysis | Graph Editing | Graph Vis ual . | Integration | Supported Formats | License |

Cytoscape (Shannon et al. , 2003) | YES | YES | YES | Web Services | SBML, PSI-MI, BioPAX | Open source |

PIANA (BIANA) ( Aragones et al. , 200 6 ) | NO | YES | YES | Cytoscape plug.in | PSI-MI, BioPAX | Open source |

ONDEX (Kohler et al. , 2006) | YES | YES | YES | SBML, PSI-MI, BioPAX | Open source | |

VisANT ( Hu et al. , 2006) | YES | YES | YES | Open source | ||

GraphWeb (Reimand et al. , 2008) | YES | YES | YES | Cytoscape plug.in | BioPAX | Public web service |

PathSys (BiologicalNetworks) (Baitaluk et al. , 2006) | YES | YES | YES | Open source |

The first column “graph analysis” indicates whether the software tool can perform structural analysis by making use of graph-theoretic algorithms (average path length, clustering, diameter, etc). The column labelled “graph editing” shows that all the tools have graph editing functionalities, i.e. the capability of adding and modifying nodes or edges (although not all packages have graphical editors). All programs implement advanced graph rendering routines that are able to display large networks (column “graph visualization”). The column “supported formats” enumerates the supported notations among those described previously. Finally, the last column (“license”) indicates the conditions of use of the software (some licenses are free only for academic use).

## 5. Conclusions

With the recent advancements of experimental techniques in molecular biology (e.g. microarrays) vast amounts of data are being gathered at an impressive rate and made available in numerous on-line databases. The systematic analysis of this data is currently the objective of many strong-minded research efforts to observe and understand the biological principles that govern the interactions between a variety of molecules at the system level. Such interactions are viewed as networks that often use graph-based notations since the later allow both explorative, statistical and functional analysis (e.g. local and global characteristics, growth dynamics). In addition, graph-based approaches may provide the means for identifying the functional units (that constitute a large network) responsible for the basic life processes in an organism at the molecular level. Such analysis may also explain the (multifactorial) mechanisms by which molecular interaction networks (as a whole or certain subsystems) appear disrupted in many forms of disease.

Although cluster analysis may be able to identify the functional components of a system it is also necessary to study the interactions among subsystems and with the environment. This also applies when evaluating network robustness with probabilistic based formalisms, such as Markov-chain models or stochastic PNs, which capture the time dependent probability of error propagation within a system but are unable to explain the course of events that are caused by interactions with the environment (e.g. transitional events) or the relationship between faults and the functional model (a fault probability equal to zero does not guarantee correct functioning of the system).

Applying highly expressive formalisms (e.g. Petri nets) presents at least two problems: a) exploratory techniques can be overly complex (e.g. NP-hard algorithms, state-space explosion) and b) models are difficult to use and understand since they require skilled users to design and maintain. These are general issues that apply to an entire class of graph-based formalisms used to model large-scale networks. As a result, no technique appears to be suitable for representing an entire system from both the structural and functional points of view. One way to approach these problems is by using multiple formalisms, i.e. applying different modelling techniques to different parts of the system in order to exploit the strengths of one notation where others are less suitable. However, we stress that the major concerns in this area are due to the absence of a widely accepted standard notation for the representation of biological interaction networks and the availability of quantitative models that are able to explain both their structural features and system dynamics.

## Notes

- The nomenclatures BSG, BPG used in this article simply denote that the network is represented respectively by an undirected or directed graph. The BSG formalism is often used to represent protein-protein and protein-DNA interaction networks whilst the BPG is employed to study metabolic networks.