Application of MATLAB in -Omics and Systems Biology

Biological data analysis has dramatically changed since the introduction of high-throughput - omics technologies, such as microarrays and next-generation sequencing. The key advantage of obtaining thousands of measurements from a single sample soon became a bottleneck limiting transformation of generated data into knowledge. It has become apparent that traditional statistical approaches are not suited to solve problems in the new reality of “big biological data.” From the other side, traditional computing languages such as C/C++ and Java, are not flexible enough to allow for quick develop‐ ment and testing of new algorithms, while MATLAB provides a powerful computing environment and a variety of sophisticated toolboxes for performing complex bioinfor‐ matics calculations. We have used MATLAB to develop the pathway signal flow (PSF) algorithm for assessment of pathway activity changes based on high-throughput gene expression and pathway topologies. Additionally, we have created a KEGGParser tool for parsing, editing, and visualizing Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway maps. We have used these tools to obtain a collection of KEGG pathways and evaluate their activity changes in different clinical forms of pulmonary sarcoidosis (PS). The application of PSF provided an extended systems view on pathway deregulation states and implicated several new pathways in sarcoidosis that had not been identified using other analysis approaches.


Introduction
Biology and biomedicine have always been quantitative scientific disciplines and data collection has been an important part of biological knowledge inference. Biological data types are very diverse and their nature has dramatically changed since introduction of new measurement technologies starting from the mid-twentieth century. DNA/RNA/protein sequencing [1,2], X-ray crystallography [3], antibody-based assays [4,5], and various modifications of polymerase chain reaction (PCR) [6,7], allowed for collection of data about various aspects of cell function in normal and diseased conditions. A few of the most common biodata types include but are not limited to: • Sequences-the one-dimensional orderings of monomers (DNA/RNA/proteins).
• Graphs-representation of a set of objects and pairwise interactions between them (pathway maps, protein-protein interaction nets).
• High-dimensional data-each sample is defined by hundreds or thousands of measurements, usually concurrently obtained (e.g., high-throughput gene expression data; see below).
• Geometric information-information about 3D structure of proteins, lipids, and nucleic acids.
• Models-mathematical or visual representations of dynamic or static behavior of biological entities.
• Literature-biological literature itself can be regarded as data to be exploited via data/text mining for revealing findings that would otherwise go undiscovered.
Traditionally, statistics has played significant role in biological data analysis [8]. It is used in biomarker identification [9][10][11], testing new drugs [12], analysis of genetic associations [13,14], understanding associations between the levels of different biomolecules, experiment planning, etc. This has been made possible due to several key factors brought together. First of all, introduction of the "data analysis" concept was made by Tukey [15], implying that there is no need to be a statistician to be able to analyze and interpret the data. Next, the active penetration of computers into various research fields, as well as development of computer software and data analysis packages that can be used by "non-statisticians," has greatly enforced the progress in biomedical data analysis.
A new era of biomedical research has started in the twenty-first century with the advent of massively parallel measurement technologies. Traditional data collection approaches (lowthroughput) are mainly focused on measurement of a few carefully selected parameters from a large number of samples, while high-throughput methods, such as microarrays, nextgeneration sequencing, and proteomics approaches, allow for acquiring dozens or hundred thousand observations from a single sample. Low-throughput techniques still remain an important tool in biomedical research; however, only high-throughput approaches provide global outlook on complex biological processes occurring at cellular, tissue, or even organismal level [16,17]. This breakthrough has also changed the main paradigm in biological data representation, shifting from datasets containing large number of observations with few variables (i.e., attributes or entries in the data vector) to datasets containing more variables than observations (high-dimension, low-sample size data (HDLSS)). HDLSS data is generated via various technologies measuring protein activity/abundance levels, gene expression levels (the amount of transcripts generated from each gene), ribonucleic acids (RNA) abundances, etc. For example, a typical high-throughput measurement of gene expression is performed for about 20,000 genes per subject, while the number of subjects rarely exceeds 10-20.
With this new data type, classical statistical approaches frequently fail to produce meaningful results because they are not designed to cope with this growth of dimensionality in datasets [18][19][20]. Thus, a demand for new algorithms and software for data analysis and visualization has emerged. From the other side, traditional computing languages, such as C/C++ and Java, are not flexible enough to allow for quick development and testing of new algorithms. This is because in contrast to scripting languages they require compilation of the whole code before execution, definition of variable types, etc. The limitations of compiled languages pose the challenge of having professional software engineers (or dedicated persons with programming skills) for writing software. In contrast, scripting languages allow for execution of separate lines of the script, without caring for variable type definitions and memory allocation beforehand. They are commonly at higher language level and additionally are supported by a wide range of packages for specific scientific purposes.
In this sense, MATLAB provides a powerful computing environment and a variety of sophisticated toolboxes for performing complex bioinformatics calculations. In this chapter, we discuss the examples of MATLAB application for high-throughput gene expression data analysis and visualization based on the algorithms and software developed by our research group.

Analysis of gene expression data
Gene expression is the realization of the information stored in genes through synthesis of ribonucleic acids (RNA) and proteins that perform functional and structural activities in a cell and an organism [21]. Genes are DNA fragments that code for products such as proteins and functional RNA, including ribosomal RNA (rRNA), transfer RNA (tRNA), or small nuclear RNA (snRNA). According to the central dogma of molecular biology, protein coding genes are expressed in a two-stage process involving synthesis of a messenger RNA (mRNA) (transcription) and synthesis of a protein on the mRNA template (translation) (Figure 1) [21]. However, in biomedical research the term "gene expression" is often used to refer only to the transcription stage, and the gene expression value usually indicates the amount of mRNA produced from each gene. High-throughput analysis of gene expression is one of the cornerstones of systems biology [16]. The study of gene expression signatures has largely contributed to better understanding of molecular pathology of lung diseases [22,23] and to identification of new disease subclasses/ entities [24]. It has also provided new approaches to diagnostics [25] and helped to suggest novel therapeutic compounds [26].
There are two main techniques that allow for massively parallel measurement of gene expression in cells and tissues: microarrays and next-generation RNA sequencing (RNA-seq). The technology details of these approaches are summarized in Figure 2 and are described in a number of publications elsewhere [2,[27][28][29]. The raw gene expression data for microarray and RNA-seq gene experiments are usually presented in a form of expression matrix. Each column represents all the gene expression levels for a single sample, and each row represents the expression of a gene across all the samples. This matrix serves as the source for subsequent analysis steps.

Figure 2.
A general summary of two main techniques for gene expression assessment. Microarray-based techniques (at the top) are based on hybridization of complementary DNAs, obtained from RNA extracted from the cells, to specially designed oligonucleotide arrays and subsequent capturing of fluorescent signals coming from hybridized probes. The more the signal, the more RNA was produced from the gene corresponding to the respective oligonucleotide. RNA sequencing (at the bottom) utilizes next-generation sequencing technologies to obtain short sequencing reads from RNA extracted from the cells. These short reads are then aligned to the reference genome to determine the corresponding genes and to compute RNA abundance for each gene.
There are many different algorithms, also implemented in MATLAB, that are aimed at analysis of global gene expression [30][31][32]. MATLAB Bioinformatics toolbox demos are excellent start points to become acquainted with microarray and RNA-seq gene expression analysis.
Most of these algorithms exploit a common gene-centered analysis pipeline, which identifies genes differentially expressed between studied conditions, and further annotates the gene lists by assessment of their relative abundance in predefined functional categories available in biological databases [33], such as Gene Ontology (GO) [34], Kyoto Encyclopedia of Genes and Genomes (KEGG) [35], and others. A shortcoming of these approaches is that they overlook the interactions that exist between gene products in a cell. These interactions define the actual behavior of biological systems, along with expression values of the interacting agents. The information on interactions occurring between gene products is depicted in topologies of biological pathways, and thus, gene expression analysis that also accounts for topology information would be more informative about the state of pathways and activities of associated biological processes. Biological pathways are directed and spatially defined sequences of biomolecular physical and regulatory interactions that represent molecular signal propagations leading to functional realizations of biological processes. The behavior of a given pathway highly depends on both gene expression and its topology [36]. It is known that a significant portion of genes may be involved in more than one pathway, while perturbation of a single pathway may be conditioned by differential expression of multiple member genes. Moreover, a single disease may be characterized by orchestrated deregulation of several pathways. Finally, profiles of pathway deregulations may be common for different diseases. Thus, based on gene expression and pathway topology data, it is possible to identify common and specific pathways among diseases.

KEGGParser
KEGGParser is a MATLAB graph-based graphical user interface tool for parsing, editing, visualization, and analysis of biological pathway maps from Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database. It is based solely on functions contained in MATLAB, MATLAB Bioinformatics toolbox version 3.x, and Image Processing Toolbox 2.x. KEGGParser is freely available at http://www.mathworks.com/matlabcentral/fileexchange/37561.
KEGG pathways are stored in a collection of manually curated pathway maps. Each pathway is represented by an image and accompanying xml (KGML) file, which stores an xml tree structure containing information about nodes and edges. The KGML files are created from the map images with "KegSketch" program.
KEGG pathways can be accessed from MATLAB using KEGG REST interface implemented in MATLAB Bioinformatics toolbox. However, KEGG REST functions are very limited and intended for very basic operations, such as retrieval of pathway images, node information, and mapping of gene expression data via coloring of nodes on top of the map images, and are not suited for much wider range of pathway analysis needed. In contrast, KEGGParser uses information stored in pathway KGML files to convert them directly to MATLAB biograph objects (a graph representation of the pathways), which then allows for performing advanced pathway analysis. Biograph is a MATLAB data structure for implementation of directed graphs. Graph nodes can represent diverse biological agents, such as genes and proteins, and edges depict directed interactions between the agents, which can represent physical, regulatory interactions, dependencies, etc. The biograph object also stores graphical properties, such as color, labels, and sizes, used for 2D visualization.
Although the biograph object is ideally suited for storing and manipulating biological pathways, its severe limitation is the absence of generic methods for adding or deleting edges on already created graphs. This can seriously impede downstream analyses, because KGML files do not contain all the information depicted in the pathway map (Figures 3 and 4).  Moreover, it is known that pathways can change their topology (i.e., interactions between nodes in a pathway) due to mutations, regulation of gene expression, , etc. In order to overcome these limitations, we have developed several functions that allow for graph manipulations on pre-calculated graphs. They are accessible from the link http://www.mathworks.com/matlabcentral/fileexchange/37475. The following example and Figure 5 demonstrate the biograph object editing process using graph manipulation tools. The graph manipulation tools are used by KEGGParser for parsing KGML files, creating and editing biograph objects that represent KEGG pathway graphs. Along with creation of pathway graphs, KEGGParser automatically handles and respectively edits part of inconsistencies between KGML files and map images. The overall KEGGParser workflow and usage examples are described in detail in the original publication [37]. In this chapter we present two different use cases for this software.
The first example refers to retrieval, editing, and visualization of KEGG pathways using KEGGParser. As an example we have chosen RIG-I (retinoic acid-inducible gene 1) -like receptor signaling pathway. RIG-I-like receptors and downstream signaling are key elements in sensing viral pathogens and generating innate immune responses [38,39]. Activation of this pathway is essential for production of various cytokines, mediators of inflammation, and proliferation of immune system cells. Deregulation of RIG-I-like pathway activity is implicated in many autoimmune disorders, such as systemic lupus erythematosus and Aicardi-Goutières syndrome [40]. In order to access the basic characteristics of the pathway topology, we used KEGGParser for retrieving and editing the corresponding KGML file. The pathway map from KEGG pathway database, as well as the native graph object parsed using KEGGParser, is presented in Figures 3 and 4 (static image and parsed without automatic corrections). As can be noticed, the parsed graph in MATLAB has many missing edges and unconnected nodes, making subsequent analysis improper. This is because KGML file contains information only about protein-protein interactions and the information about other types of interactions present in the map images is lost after KGML parsing. Using KEGGParser, we manually edited the pathway graph object and restored missing edges (Figure 6).
In order to perform graph theory-based analyses, we stored unedited and edited pathway graphs in rig_like.mat and rig_corr.mat files, respectively (available at [41]). Using graph theory functions implemented in MATLAB Bioinformatics toolbox, we performed some basic comparisons of unedited and manually edited pathway graphs. First we compared connectivities of nodes in the graphs: Analysis of node degree histograms in the unedited graph shows significant skew toward 0degree nodes, compared to the edited graph (Figure 7A and B).
The results showed that there are five strongly connected multi-node components (with four nodes on average) in the unedited pathway graph, while the edited graph identifies six components containing on average seven nodes (Figure 7C and D). These components correspond to nodes that form separate pathway branches and lead to different functional outcomes associated with pathway activation.
Finally, the distributions of all lengths of shortest paths between pairs in the unedited and edited graphs were significantly different, showing longer average path in the edited graph consistent with the static pathway map image (Figure 7E and F). Thus, KEGGParser can facilitate the better representation of biological pathways in MATLAB and contribute to the adequate analyses of pathway topologies. Manual/automatic editing options help to restore correct topologies of the pathways. Using KEGGParser, we have created a collection of signaling biological pathways that were further used for analysis of pathway activity perturbations in various diseases (see pathway signal flow (PSF) section).

Pathway signal flow
The PSF algorithm can be used to assess the changes in activity of a given biological pathway depending on the pathway topology and relative gene expression [42]. In contrast to the traditional gene-centered approaches for expression data analysis, PSF also takes into account the interactions between gene products (i.e., proteins, RNA, etc.) and other biological entities and, thus, provides richer outlook on actual molecular events associated with pathways.
This algorithm calculates how the activating signal is propagated from pathway input nodes, through interactions between intermediate node pairs, to the output nodes, leading to functional realization of associated biological processes. The amount of signal approaching the output nodes is called PSF. The extent of changes in the pathway flow is indicative of the likeliness of the given pathway to be involved in the biological processes underlying the phenotypic differences between the studied conditions. The detailed description of PSF is given in a number of publications [42,43]. Here we will bring an example of PSF usage for analysis of pathway flow perturbations in pulmonary sarcoidosis (PS) and its different clinical forms.
PS is a systemic granulomatous disease with unknown cause [44]. It is characterized by massive influx of activated T-lymphocytes and macrophages into the lungs, formation of granulomas, and lung function failure. The immune disturbances and and granulomatous deposits resolve spontaneously in 60-70% of PS patients; the rest follow a chronic course [44]. Though significant advances have been made in understanding of immunological features of this disease, the central pathomechanisms of its development are yet unknown. In this study, we aimed at identification of differentially deregulated pathways in sarcoidosis, as well as in different clinical forms of the disease, compared with healthy lung. We have used two microarray gene expression datasets from Gene Expression Omnibus public repository (dataset IDs: GDS3580 and GDS3705). The gene expression data and PSF scripts are available for download from [41].
The results of PSF analysis indicate that inflammation-related pathways are significantly upregulated in sarcoidosis compared to healthy lung, while pathways interfering with cell proliferation and fibrosis are downregulated ( Table 1). Moreover, compared to self-limiting PS, the progressive form of the disease is characterized by more prominent deregulation of pathways associated with proinflammatory response and fibrotic conversion of the tissue ( These findings are in compliance with the existing knowledge on sarcoidosis pathogenesis. Numerous experimental studies, including our own, have implicated immune/inflammatory pathways, such as Toll-like receptor signaling, phagocytosis, and chemokine signaling in sarcoidosis [45][46][47]. However, application of PSF provided an extended systems view on pathway deregulation states. Moreover, PSF implicated several new pathways that were not detected using gene-centered analysis approaches in the original publications [48,49].

Conclusions
This chapter demonstrates the advantages of MATLAB for performing bioinformatics research. The powerful scripting language combined with various toolboxes makes it a valuable tool for creation of complete pipelines for -omics data analysis and visualization. We have contributed to MATLAB Bioinformatics toolbox with the PSF algorithm for assessment of pathway activity changes, and KEGGParser for fine-tuned pathway editing and visualization. MATLAB GUI support allows for visualization of the results, making it easy to use for the general scientific audience. Combining our tools with the rest of the MATLAB Bioinformatics toolbox has the power of having various complete pipelines for high-throughput data analyses.
Finally, MATLAB scripting language allows for easy development and testing of various algorithms that can later be translated into other scripting and programming languages. It should be noted that KEGGParser and PSF, originally developed in MATLAB, were then ported to various programming and scripting environments, such as Java and R [43,50].