Open access peer-reviewed chapter

Integrative Systems Biology Resources and Approaches in Disease Analytics

By Marco Fernandes and Holger Husi

Submitted: October 4th 2018Reviewed: January 30th 2019Published: March 30th 2019

DOI: 10.5772/intechopen.84834

Downloaded: 201

Abstract

Currently, our analytical competences are struggling to keep-up the pace of in-deep analysis of all generated large-scale data resultant of high-throughput omics platforms. While, a substantial effort was spent on methods enhancement regarding technical aspects across many detection omics platforms, the development of integrative down-stream approaches is still challenging. Systems biology has an immense applicability in the biomedical and pharmacological areas since the main goal of those focuses in the translation of measured outputs into potential markers of a Human ailment and/or to provide new compound leads for drug discovery. This approach would become more straightforward and realistic to use in standard analysis workflows if the collation of all available information of every component of a biological system was ensured into a single database framework, instead of search and fetch a single component at time across a scatter of databases resources. Here, we will describe several database resources, standalone and web-based tools applied in disease analytics workflows based in data-driven integration of outputs of multi-omic detection platforms.

Keywords

  • systems medicine
  • bioinformatics
  • omics
  • data integration
  • pathway analysis

1. Introduction

Over the last decade the emergence of high-throughput screening platforms and the increase in availability of large-scale-omics data, as well as clinical data from electronic health records comprising phenotypic, therapeutic and environmental factors information opened the possibility to mechanistically understand diseases and diseases stages at the molecular level. Thereby, a great number of wealth data in many kidney and cardiovascular conditions was generated, however these findings were neither translated nor reached the clinical setting and are still enclosed in peer-reviewed literature and across general scope expression profiling databases. Simultaneously it has become apparent that the existing systems to integrate and correlate this data are either inadequate or non-existent. Due to the multi-factorial molecular phenotype of disease, it is evident that development of novel therapeutic and disease detection approaches should be based upon the study of the entire “System” simultaneously. Figure 1 gives a general overview in the fundamental difference between conventional and systems approaches, whereby in the context of conventional approaches a hypothesis is put forward that is assumed to be of importance in the disease or biological condition. This hypothesis is then tested and either validated or refuted based on the outcome of this hypothesis-driven methodology. Yet, it is obvious that it is easy to investigate any hypothesis and then choose the one that appears most correct, in the real world constraints such as time and financial resources do not allow for such an approach, and hypotheses are usually generated on a best-guess basis which can lead to a substantial amount of bias, resulting in skewed or partial insights and can often be misleading. In order to avoid such scenarios, research driven by the data itself rather than a hypothesis has been proposed a long time ago, but could not be properly implemented due to the lack of unbiased large-scale data or the ability to integrate disparate data in the first place. Additionally, a successful systems approach requires underlying prior knowledge, such as physicochemical parameters in how molecules interact with each other, what reactions they are involved in and other unconnected information. This knowledge has only slowly been accumulated through conventional research and has only over the last 10–15 years been available to such an extent where a systems approach became feasible. Data-driven systems biology-based diagnostic and prognostic models consisting of relevant panels of molecules—key branches of the cellular network, appear to more accurately reflect pathophysiology than traditional hypothesis-driven approaches, consequently, may have a much higher chance of success and implementation in the clinical setting. Of the most pronounced effects is the crossing between research borders and the urge for multidisciplinary integration of biology, chemistry, computing sciences, mathematics, and medicine to tackle the complexity of such system. To get a holistic view of a system’s biology, multiple and different types of observations must be combined, such as clinical which includes pathological, demographical, epidemiological, and as well as molecular, which includes large-scale genotyping, gene expression, proteomics, metabolomics, and lipidomics data. The downside of such an approach in disease analytics or data integration is the rise in complexity both in output as well as in methods needed to generate those, and the skills required to interpret and contextualise outcome parameters. However, biological and disease models generated this way allow for a higher confidence in generating testable hypotheses, disease classifications on a molecular level and identification of overlapping and divergent pathways of malignant conditions. Ultimately, the removal of bias and integration of all available data, both clinical and biological, leads to a far better understanding of disease and enables the identification of intervention points with higher confidence and accuracy.

Figure 1.

Overview of general differences between conventional and Systems Biology approaches in biological and disease analysis research. Red arrows show the path of the conventional hypothesis-driven methodology including testing of a hypothesis, usually employing lab-based investigations, and re-adjusting the hypothesis dependent on outcomes. Blue arrows denote a systems approach, where data are integrated and analysed, producing a model system and a hypothesis that can be verified using conventional methods. Outputs of such an approach are usually fed back into the model or the data analysis stream to refine models, adjust hypothesis or confirm the established model.

2. Disease classification boundaries

The standard resource for disease taxonomy relies primarily on the International Classification of Diseases (ICD) which displays information on diseases and health conditions, and a continuous monitoring of the associated epidemiological statistical trends World Health Organisation [1]. The foundations of the ICD disease classification relies mainly in a type of evidence-based medicine with distinction of clinical features, including patient symptoms, histological assessment, and evaluation of risk factors [2]. While widely used in the clinical setting, in the era of “big-data” and precision medicine, its rigid hierarchical structure lacks the flexibility needed to accommodate the fast and expanding molecular-insights of disease-phenotypes captured across many -omics platforms [3]. Moreover, to support this notion of undefined disease boundaries across current disease classification, we can observe the existence of co-occurring conditions that if seen as a unified biological network, could provide information about common multi-functional genes, cellular pathways, as well the impact of lifestyle [4]. Additionally, analysis of disease progression with the presence of overlapping conditions through evaluation of temporal correlation and disease progression patterns condensed from a population can become useful in the prediction and prevention at the patient’s individual level in future disease-associate events [5].

Further disease taxonomy refinement can be achieved by applying network analysis [3] of combined disease phenotypes sourced from ICD-9 with protein-protein interactions (PPI’s) data from STRING [6] and additional curation efforts of gene-disease associations (GDA) from several data sources. The network analysis allowed for reclassified of pancreatic cancer into 11 subclasses, which is consistent with the number of molecular subtypes observed in the Bailey et al. [7] study. They also proposed the use of such approach in drug repurposing, for instance therapy with metformin, a well-known agent used to treat type 2 diabetes mellitus (T2DM), that could regulate the imbalanced status of the microbiota community in the gut mucosae, a known cause of pathological chronic bowel inflammation as occurs in Crohn’s disease and ulcerative colitis [8], and also act as preventable agent to reduce the risk of colorectal cancer. Moreover, molecular profiling associated with histologic assessment seems to yield enhanced probabilistic scores in graft survival predictions. For instance, joint integration of multi-center histology features in renal biopsies and gene-array data yielded a new molecular score system able to predict renal graft survival [9] and improving the diagnosis of antibody-mediated rejection of transplanted in hearts [10]. Such approaches can also be implemented to assess disease trajectory, treatment selection and monitoring in many neoplasms, and could be specially tailored for cases where the tumour primary site is of unknown origin [11].

3. Systems biology towards systems medicine

Over the last 15 years, the rise of systems biology as a research field has changed how we look at human normal physiological function and has helped to uncover disease complexity. Now scientists use systems biology approaches to understand the big picture of how all the pieces interact in an organism. The inference of genotype-phenotype relationships boosted by the assembly of a high-quality human genome opened the avenue for the development of reference maps of interactome networks, [12] consisting of binary association pairs, for instance PPI’s, protein-DNA/RNA, or protein-metabolite interactions. Figure 2 shows the essential biological molecular interactions governing cell behaviour in an over-simplified biological system. A curated compilation of high-quality sources of binary interactions is considered a prime resource in the Systems Biology field and thereby enabling a deeper understanding of the larger picture—be it at the level of the organism, organ, tissue, or cell—by putting its components together. It’s in stark contrast to decades of reductionist biology, which merely focuses on the properties of its individual components [13]. Most disease conditions exhibit expression of complex disease phenotypes [13], such as obesity, metabolic syndrome, autoimmune diseases and renal diseases.

Figure 2.

Description of the essential known relationships/interactions in an over-simplified biological system. Transcription factor (TF), microRNA (miRNA), post-translational modifications (PTMs). The illustration does not account for epigenetic modifications, for instance DNA methylation and histone modifications known to occur and regulate gene expression. Dark coloured arrows denote entity associations, while self-circular arrows describe self-pair interactions or modifications.

Using the words of Ronald Germain to provide a definition of Systems Biology, he advocates that: “There are an endless number of definitions, it’s even worse than the elephant,” that infamous elephant that stymies the attempts of blind men to describe it because each feels just one part, “Some people think of it as bioinformatics, taking an enormous amount of information and processing it.” “The other school of thought thinks of it as computational biology, computing on how the systems work. You need both parts.” Ironically, to best understand this novel approach, we should take a reductionist approach to defining its parts. The system, it seems, is more than the sum of its parts [14]. Systems Biology requires comprehensive data at all molecular levels, a profound understanding of biological systems, data-criteria based assessment and in-deep understanding of the limitations of the techniques used in the experimental setup. Moreover, systems biology requires prior knowledge either published or sourced from biological databases and newly predicted and frequent molecular events requires further in vivo/vitro validation [15]. Systems Biology is cross-disciplinary: “[…] a scientific approach that combines the principles of engineering, mathematics, physics, and computer science with extensive experimental data to develop a quantitative as well as a deep conceptual understanding of biological phenomena, permitting prediction and accurate simulation of complex (emergent) biological behaviours” (Ronald Germain in [14]). Furthermore, systems biology promotes understanding of the functional roles and interplays of all molecules in cells in health and disease. Also provides a framework for large-scale data-driven analysis and predictions based on prior knowledge of experimentally identified interactions and pathways [16]. Thus, more relevant that the underlying high-throughput screening methods, including genomics, proteomics, metabolomics, and also bioinformatics approaches is the use of such methods in a integrative manner to holistically understand how nonlinear processes and their outcomes are regulated in a biological system [17].

3.1 Bridging the gap between fields

Over the last 10 years, major efforts to reclassify diseases based on molecular insights from advances in molecular biology, bioinformatics and high-throughput screening yielded novel disease subtypes among many disease conditions. The use of multiple data types, including clinical endpoints—omics and ontology-based data have been used to reconstitute disease phenotypes, classify and to refine disease-relationships [18]. Nevertheless, the development of a molecular-based disease taxonomy that links global molecular networks with pathological phenotype landscapes remains elusive. Systems medicine can be perceived as a multi-disciplinary collaborative effort driven by the application of systems biology approaches, which includes methodological workflows from high-throughput-omics technologies to generate data, warehousing management systems for data flow and handling and methods for data analytics and interpretation in the context of biomedical research [19]. Ultimately, with further adoption of a systems-based approach patients will benefit of a measurable improvement of their health status since processes of disease onset and progression will be mechanistically identified, leading to new insights regarding disease-disease boundaries, and disease subtyping which facilitates ideal pharmacological interventions as drug repurposing [20]. For instance, the identification of digoxin, a drug used as therapy for atrial fibrillation and congestive heart failure [21] as potential drug candidate for pharmacological intervention in medulloblastoma subtypes 3 and 4 [22]. The authors of the study implemented an integrative systems biology approach using genomic data and collating existing drug-drug, drug-targets interactions information into a tridimensional functional-drug network. This approach involved handling omic data sets such as DNA-seq—mutated genes, copy-number variation (CNV)—repeated sections of the genome, RNA-seq and methylation profiles, combined with clinical measurements of patient outcomes (survival data) and fused using network-based and probabilistic methods that yielded a network composite with disrupted driver signalling networks and potential drug candidates [22].

4. Large-scale data: omics platforms

The advent of new high-throughput technologies (sequencing, array-based and mass spectrometry) led to an explosion of available data, not only by the number of experiments performed, but also by the data density obtained per experiment. Here, we will provide description of detection platforms handling molecular datasets; for medical imaging data types and analysis strategies please see the following review [23].

4.1 DNA microarrays and next-generation sequencing (NGS)

Microarray technologies have been widely used in research for primary screening, including gene expression profiling and providing genotype-phenotype relationship. Moreover, if properly designed, microarrays will not only provide information on gene expression and expressed single nucleotide polymorphisms (SNPs), but also detect exon junctions and fusion genes [24]. However, identical to PCR-based techniques, the design of probes requires prior knowledge. Therefore, microarrays are mostly applied in the quantification of known sequences and not for the discovery of new variants, transcripts or other unknown features [25]. Microarrays have numerous limitations. For instance, they render an indirect measurement of the relative concentration of a particular nucleic acid sequence [26]. Another limitation is based that a DNA-array can only detect sequences that the array was designed for. In addition, non-coding RNA’s that are not yet recognised as expressed are typically not represented on an array [26]. Microarrays are still considered a reliable technique for routine and/or initial screening that allows multiplex quantitation of microRNAs and gene probes expression in a fast, simple and affordable way. Nevertheless, the continuous drop in the cost of NGS at a level that virtually matches the cost of DNA microarray-based platforms, thus is foreseen that DNA-arrays will be fully replaced by sequencing methods within the next decade [26].

4.2 Proteomics

The use of omics technologies, including quantitative proteomics methods aims to identify and quantify the dynamics of protein abundance, in order to gain a deeper understanding of the associated biological functions. Thereby, the quantification of the expression level and state of all proteins at a given time can characterise physiological-states at the cellular-level [27]. Mass spectrometry (MS) technology, particularly tandem mass spectrometry (MS/MS), has been utilised as a discovery engine in proteomics [28]. This technology allows for identification and simultaneously quantification of hundreds or even thousands of proteins in an experimental setup, which enables real-time comparisons for instance between two or more physiological states [29]. Furthermore, peptide sequence composition will directly impact on ionisation efficiency, and their intensities observed in a spectrum often do not reflect their abundances, [30] thereby many label-free or label-based quantitation methods have arisen to allow comparative proteomic analysis. For instance, label-free proteomic approaches such as ion intensity, spectral counting have a simplified workflow when compared to labelling techniques; have no theoretical limit concerning multiplexing capability providing an improved proteome coverage, but lower quantification accuracy when compared with labelling methods (e.g. iTRAQ: isobaric tags for relative and absolute quantitation, SILAC: stable isotope labelling by/with amino acids in cell culture) [30]. In proteomics, several algorithms have been developed to query and cross compare MS data. The most popular used to identify proteins from raw MS data are for instance, MASCOT, SeQuest, OMSSA, X!Tandem [31], Andromeda [32], MS-GF [33], Paragon [34] and more recently, Morpheus [35] and an improved SEQUEST-like algorithm—ProLuCID [36]. The rise in the number of algorithms and specialised computational tools for analysis of MS-based proteomics data sets led to the development of workflows/pipelines such as PEAKS [37], MaxQuant [38], OpenMS Proteomics Pipeline (TOPP) [39], Trans-Proteomic Pipeline (TPP) [40] and others for further downstream data analysis—Perseus [41].

4.3 Metabolomics

In many metabolomics studies the identification and quantification of metabolites mainly rely on the application of analytical methods based on mass spectrometry (MS) (either coupled with a liquid or gas-chromatograph) and nuclear magnetic resonance (NMR) spectroscopy [42]. Metabolites are defined as small molecules, usually less than 1000 Da, which suffer several changes during cellular metabolism [43]. The selection of a particular platform depends upon the aims of the experimental study and is typically driven by establishing a compromise among sensitivity, specificity, and scanning speed [44]. Metabolomics approaches can be globally split either by the full range measurement/analysis of all compounds in a given sample—untargeted metabolomics, or targeted metabolomics, in which a set of predefined and biochemically well-characterised compounds are measured in a sample [44]. MS has become an essential method for non-targeted profiling of metabolites in complex bio-samples, particularly low-abundance metabolites, due to its high sensitivity and selectivity capabilities when using liquid chromatography (LC) coupled to tandem MS/MS [45]. Metabolomics data from NMR and MS platforms are complex because they usually contain thousands distinct peaks therefore, multivariate statistical analysis plays an important role in metabolomics for reducing data dimensions, differentiating similar spectra, and in the development of predictive models [46]. Metabolomics is used as a screening tool in current healthcare settings, and could be greatly utilised to monitor therapy efficacy, and assess potential drug side-effects [47].

5. Data-driven approaches and multi-omics data integration

In the field of biomedical research adopting an unbiased approach or “hypothesis-free” (depending of the author and field of study, also defined as hypothesis-generating approach, data-driven research, or discovery research) to research can bring several benefits when compared with the widely used scientific approach—hypothesis-driven research (traditional approach). In which, the latter, in some cases encourages poor scientific practices by forcing/imposing qualitative and weak hypotheses that Are not prepared for strong statistical inference or quantitative analysis (QA) modelling, thereby in such cases an explicitly exploratory approach should be set as default [48]. In order to overcome this problem, large-scale approaches such as expression profiling started to become very popular in the mid ‘90s, and beginning of 2000, with the advent, rapid development and availability of high-throughput mass spectrometry, other methods followed [49]. Computational methods to analyse this flood of data were developed accordingly, however the majority only focused on one specific technology or experimental setup and up to this day are very often not interchangeable in other technological platforms. Large-scale approaches employed in omics research need a different analysis methodology, which is especially true if integrative analysis techniques are employed. True integrative (as opposed to integrating linear relationship data such as gene-protein data) approaches go beyond simple data fusion and gave rise to the field of Systems Biology. On the other hand, hypothesis-generating research (systems biology-derived hypotheses) and hypothesis-driven research are complementary, thus combining both approaches will certainly sustain more chances of a complete understanding of complex biological systems, than either approach on its own [48]. With the advent of high-throughput technologies their application in the biomedical field was a foreseen logical step. However, until recently integration of multi-omic data was not a common approach in former analysis workflows. The literature and publicly available databases are awash with data, yet the main approach of integrating all this information in a disease-specific context is traditionally based on meta-analysis at best or cannot be accomplished using standard computational methods. This molecular information can then be integrated in a further stage by means of meta-analysis or by cross-normalisation of data from different acquisition platforms [50]. A combinatorial stepwise data integration (Figure 3) approach can be used in order to incorporate data from different biological layers of information to predict phenotypic outcomes [51]. On the other side, by recreating the cell environment and dynamics by describing their interactions on a qualitative and quantitative manner and relying on underlying data (prior biological knowledge) for connectivity, e.g. PPI’s, molecular co-occurrence, ontologies and enzymatic reactions [52]. Large-scale data sets for instance derived from multi-omics platforms may also be used to infer novel relationships by network learning approaches using Bayesian inference models [51] and extracting molecular information from multi-layered networks. This approach (as in many others) is challenging since it requires enough statistical power, higher number of samples to deduce all the possible interactions. Another challenge is due to the lack of uniformisation regarding the ‘gold ‘standards (criterions for evaluation) for accepting or rejecting relationships of the inferred model; however the ability to recreate a well-accepted interaction can at least be used for benchmarking methods in biological systems [53].

Figure 3.

Purposed workflow for a data-driven approach. Data generation from omics platforms plus existing biological information (a), development of a multi-omics database (b), selection of suitable modelling methods (c), model validation and use for hypothesis-generating research (d), lead optimization and candidate selection (e).

6. Biological databases and database systems

Databases form the basis for most applications in bioinformatics. The number of biological databases available now is enormous, the journal of Nucleic Acids Research (NAR) catalogues a total of 1737 molecular biology databases (2018 edition) [54]. The 2018 edition contains an enormous set of 181 papers that describe the adding of 82 new biological databases, 84 updates and as well 15 databases published elsewhere. However, a prominent issue concerns that many databases are not maintained over time and abandoned, yet they persist in database listings. There are many different types of databases, ranging from primary databases containing sequence data such as nucleic acid or protein; secondary databases or also known as pattern databases hosts, that results from the analysis of the sequences held in primary databases.

6.1 General scope expression databases

The Gene Expression Omnibus (GEO) [55] is a public repository that functions as both warehouse of raw microarray and other gene-based high-throughput data, and additionally serves as a platform for gene differential expression (DE) analysis using the GEO2R tool across a multitude of experimental conditions of user-submitted pre-processed data sets. In the same way, the European counterpart for storing of high-throughput genomics exists such as the European Bioinformatics Institute (EMBL-EBI) throughout the ArrayExpress database [56]. These data resources are both in compliance with community guidelines for description of an experimental setup for microarray and high-throughput NGS experiment. Comparatively, there is currently much less support for sharing of proteomics and metabolomics data sets despite the increasing demand. Public efforts for proteomic data sharing yielded the Proteomics Identification Database (PRIDE) that contains over 10,100 user-submitted MS-based raw proteomic data sets (September 2018) [57]. PeptideAtlas [58] handles re-analysed data sets via the TPP pipeline to provide end-users a consistently view over their data. MetaboLights [59] hosts user-submitted metabolomics experiments, which currently houses 439 experiments (November 2018). The standards for reporting proteomics and metabolomics experiments are coordinated by the Human Proteome Organisation’s Proteomics Standards Initiative (HUPO-PSI), and Metabolomics Standards Initiative (MSI) respectively.

6.2 Disease profiling databases

Our group developed more specialised databases resources in several disease conditions handling pre-selected data sets containing DE molecules. In nephrology, we developed the Chronic Kidney Disease database (CKDdb) [60] storing microRNA, genomics, peptidomics, proteomics and metabolomics information relevant to CKD, collected from over 300 studies in the literature and integrated into the Pan-omics Analysis DataBase (PADB). The PADB framework (www.padb.org) uses gene and protein clusters (CluSO) and mapping of orthologous genes (OMAP) between species therefore facilitating data harmonisation from a diverse range of omics platforms and across several species, which makes it an invaluable resource for systems biology data-driven approaches. Also, many conditions associated with the cardiovascular system are covered in the Cardio/Vascular Disease (C/VD) database [61], which gives special emphasis on coronary artery disease (CAD). In neurological associated conditions such as Multiple Sclerosis we also developed the MuScle database [62] that stores and integrates curated data sets mined from large-scale studies with focus on genomics and miRNA. Likewise, we built a cancer-related differential expression database: the Multi-Omics Cancer database (MoCadb) that integrates clustered molecular information covering multi-omics studies in many gastro-intestinal cancers. In the same framework we also cover an assorted disease profiling database valuable for subtractive disease analysis studies, the Large-Scale Screening Resource (LSSR) that contains 81,980 entries, referring to 13,589 molecules. Moreover, a peak profiling database for biomarker patterns research, the Urinary Peptidomics and Peak-maps (UPdb) [63] database that comprises Human urinary fingerprints from 200 subjects analysed mainly through surface enhanced laser desorption ionisation-time of flight mass spectrometry (SELDI-TOF-MS).

7. Software tools and solutions

Many modern high-throughput technologies lead to the generation of exceptionally large-scale and complex datasets, which includes PPI’s, protein-DNA interactions, kinase-substrate interactions, qualitative and quantitative genetic-interactions gene co-expression [64]. The “Big Data” challenge can be fulfilled by the development of Bioinformatics tools to handle these large-datasets to reduce their complexity to a level that enables rationale interpretation and in this way is more likely to provide new biological insights to the Life Sciences. The compilation (not an exhaustive list) of many web-based, standalone tools and R-based packages are described in Table 1. They allow the accomplishment of different-omics tasks such as feature selection, sample classification, multivariate methods. Cytoscape [65] is a tool primarily designed for network visualisation and analysis and has useful plugins available through the hosting website. Cytoscape makes use of a wide wealth variety of plugins to extend its functionality which are designed by the scientific community. The platform counts with several freely available apps/plugins (over 300 apps available on November 2018) for a diverse array of uses and analysis types.

NameDescriptionWebpageRef.
iClusterPlusIntegrative clusteringbioconductor.org/packages/iClusterPlus[84]
mixomicsData integration (CCA,PLS,PCA)mixomics.org[85]
omicade4MCIA and CIAbioconductor.org/packages/omicade4[86]
pwOmicsPathway-based integration of omicsbioconductor.org/packages/pwOmics[87]
PRESTODimensionality reduction of multivariate datagithub.com/saramcardle/PRESTO[88]
caretClassification and regression trainingcran.r-project.org/web/packages/caret
GEO2RIdentify DE genes using GEOquery & limma R packagesncbi.nlm.nih.gov/geo/geo2r[55]
Metabo AnalystMetabolomics analysismetaboanalyst.ca[89]
Networkanalyst/INMEXIntegration of gene DE via network approachesnetworkanalyst.ca[90]
ExAtlasMeta-analysis & visualisation of gene DElgsun.irp.nia.nih.gov/exatlas[91]
Elastic netGene DE with fitted GLMhttps://zenodo.org/record/16006[92]
ATHENAIntegration of genomics with clinical dataritchielab.org/software/athena-downloads[93]
Network propagationGene DE, mutations, PPI’shttp://apps.cytoscape.org/apps/Diffusion[94]

Table 1.

Web-based, standalone tools and R packages dedicated to different-omics tasks such as feature selection, sample classification, multivariate approaches in data integration and meta-analysis.

PMA, Penalised Multivariate Analysis; RGCCA, Regularised and Sparse Generalised Canonical Correlation Analysis for Multiblock Data; caret, Classification and REgression Training; ATHENA, Analysis Tool for Heritable and Environmental Network Associations; CCA, Canonical-Correlation Analysis; PLS, Partial Least Squares; PCA, Principal Component Analysis; CIA, Co-Inertia Analysis; MCIA, Multiple Co-Inertia Analysis; GO, gene ontology; DE, differential expression; GLM, generalised linear models.

7.1 Gene ontology (GO) and pathway-term-enrichment

The Gene Ontology (GO) consortium [66] aims to capture the increasing knowledge on gene function in a controlled vocabulary applicable to a wide range of organisms. GO represents genes and gene products attributes on matters of their associated biological processes (BP), cellular components (CC) and molecular functions (MF). GO is considered roughly hierarchical, with ‘child’ elements (terms) being more specific than their ‘parent’ elements (terms), nevertheless, a ‘child’ element (term) might have more than one parent element. The ClueGO app [67] is used for the integration and visualisation of GO and pathway terms sourced from KEGG [68], WikiPathways [69] and Reactome [70]. The resultant ClueGO network is established based in kappa statistics which shows the agreement on how any given gene and/or gene products pairs share similar terms. The ClueGO analysis output is conditioned by thresholding of the kappa coefficient, in which a higher coefficient conducts only to the visualisation of close-related terms with very identical gene products. While, lower kappa coefficients will let visualisation of less associated terms.

7.2 Gene-disease associations (GDA)

The conclusion of the Human Genome Project led to the massification of research related with uncovering genotype—disease phenotype associations [71]. This event translated in a disparate growth in the number of publications and on the other side a limited and slow paced biocuration of these newly discovered evidences. Currently, DisGeNET [72] unifies biomedical literature evidence based on GDA collated from a multitude of databases. This database makes use of the Medical Subjects Headings (MeSH) tree structure for disease classification by a Unified Medical Language System. The potential of the database is extended by disgenet2r package and optional programmatic access.

7.3 Protein-protein interactions (PPIs)

STRING database [6] collates molecular information to cover both known and predicted PPI’s. All molecular interaction data is originally from primary interaction databases such as IntAct [73], BioGRID [74] and additional text-mining, coexpression and high-throughput experiments and computationally predicted PPIs. The up-to-date database version 10.5 comprises nearly 26 million PPI with a confidence score greater than 0.9 of more than 9 million proteins across 2031 organisms. GeneMANIA is another source for PPIs analysis and is accessible via web interface [75], and also as a Cytoscape app that can be used to detect related genes of a input query by means of a “guilt-by-association” strategy, which explores the realisation that a protein function can be obtained from another by seeing whether it interacts with another of known function. The app uses a large database of functional interaction networks, indexing 2152 association networks containing more than 500 million interactions mapped to 166,084 genes from nine organisms.

7.4 Combining metabolomic and gene expression data

Multi-omics datasets might not only contain protein and gene data, but also expression profiles of chemical compounds. While it is easy and straightforward to combine protein/DNA/RNA expression data using common identifiers, this is not the case for metabolism end-products—metabolites. This requires a guilt-by-association, which explores the rationale that metabolites are frequently produced by enzymes and a shift in metabolite expression can reflect an up-stream shift in protein or gene expression. This involves semantic searches in enzyme repositories—BRENDA to identify potential proteins and has some inherent pitfalls such as uncertainty which enzyme/isoform is responsible for the metabolic change. Additionally, the same compound could also be generated by several proteins, which adds to the uncertainty. Therefore, metabolic datasets are often treated as separate entities in multi-omics studies and analysed independently and then converged only at the level of final outcomes [76]. The MetScape 3 app [77] for the Cytoscape can perform joint analysis of both metabolomic and gene expression data and allows visualisation of the entire fused network, or by selecting custom views based on metabolic pathways When dealing with large-scale datasets, there is the option to use a concept file based on pre-computed gene set enrichment analysis (GSEA), along with statistical and fold-change thresholds.

7.5 Transcription factor (TF)-driven modules and microRNA-target regulation

Transcription factors (TF) are critical for the regulation of gene expression since they control if gene’s DNA is transcribed into RNA [78]. A compendium on non-redundant TF and TF binding sites can be found at JASPAR [79]. The number of human TF ranges from 1500 to 2600, depending on source and stringency [78]. Direct analysis of modulated events due to TFs is not only valuable but might shed light on hidden elements that conventional pathway analysis cannot reveal. However, many TF binding sites and modulated genes are very hypothetical and often a random guess. Therefore, network-based analysis and interpretation involving TF elements should be taken with caution. CyTargetLinker [80] for extends existing biological networks by adding interactions associated with regulatory elements such as TF-target, miRNA-target or drug-targets. The application requires a loaded network with network attributes preferentially mapped to Ensembl, NCBI gene, UniProt, miRBase or DrugBank. Similarly, in CluePedia [81] users can perform miRNA analysis, by matching it to target-genes via selection of different database resources custom versions. Users can upload a list of genes and query the app to perform gene/miRNA enrichments. Then it will generate a miRNA-target interaction network that can be reused for inline integration with GO and pathway term clustering [81] within ClueGO.

7.6 Pathway mapping and visualisation

7.6.1 PathVisio pathway mapping and edition

PathVisio [82] allows drawing, edition, and visualisation of pathways handling gene, protein and metabolite data that can be further cross-mapped via the BridgeDb [83]. Inference of relevant pathways is based on an archive of pre-existent pathway maps from WikiPathways [69] and Reactome [70], establishing pathway over-representation based on a Z-score statistical procedure under the hypergeometric distribution and a P-value ranking based on a permutation procedure (randomisation test) that compares actual and permuted Z-scores. Pathways with a permuted P < 0.05 are considered significant by default.

7.6.2 KEGG pathway mapping

KEGG is an integrated database resource of biological systems integrating genomic, compound and functional information. KEGG allows analysis of datasets from high-throughput omics technologies by uploading a list of genes/proteins or metabolites along with optional statistical scores and fold-change values. After converting to KEGG internal identifiers, the molecular data is matched (KEGG mapper) into a collection of curated pathways, covering metabolism, signalling transduction pathways, specific pathways for several disease conditions and drug development.

8. Conclusions and future perspectives

The availability of large-scale multi-omics data has opened the avenue to gain an unrivalled insight in disease-associated molecular pathophysiological changes. Simultaneously it has become apparent that systems to integrate and correlate this data are either inadequate or non-existent. The literature and publicly available databases are awash with data, yet the main approach of integrating all this information in a disease-specific context is traditionally based on meta-analysis at best or cannot be accomplished using standard computational methods. In order to better model complex organisms, samples from multiple tissues of the same individuals should be studied simultaneously using omics data, which will require the development of novel analysis methods. Acquiring the relevant tissues and/or body fluid sources from Human study cohorts can of course be difficult, thereby comparative systems biology may help identify which organisms may be similar enough in each aspect to be used as models. It is sometimes suggested that omics technologies and systems biology have failed to deliver many breakthrough enhancements to the treatment of complex diseases. In some cases, it may be that in fact such diseases are not truly one disease from a system or reductionist point-of-view, but several with the same or similar phenotypic end-points—i.e., with the current terminology they are unknown subtypes of disease. If this is the case, then the overlap between the systems is poor and statistical methods which the approach relies on require very large cohorts for identification of these subtypes and subsequent description of each system. Other possibilities are that longitudinal data or samples from different tissues are required. Other relevant concerns arise from biomarker validation studies, such as correlated observations (i.e. multiple observations per patient), multiplicity (testing multiple biomarkers or endpoints), multiple clinical endpoints (interest in more than one relevant endpoint) and selection bias (from retrospective data or observational study). Data-driven investigations using systems biology approaches, although offer complete views over the function of biological systems in health and disease its limited by the state of completeness of prior biological information.

Acknowledgments

The research leading to these results has received funding from the European Union’s Seventh Framework Programme FP7/2007–2013 under grant agreement FP7-PEOPLE-2013-ITN-608332. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this manuscript.

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Marco Fernandes and Holger Husi (March 30th 2019). Integrative Systems Biology Resources and Approaches in Disease Analytics, Systems Biology, Dimitrios Vlachakis, IntechOpen, DOI: 10.5772/intechopen.84834. Available from:

chapter statistics

201total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Lentiviral Vectors Come of Age? Hurdles and Challenges in Scaling Up Manufacture

By Juan C. Ramirez

Related Book

First chapter

Introductory Chapter: Gene Profiling in Cancer in the Era of Metagenomics and Precision Medicine

By Katerina Pierouli, Thanasis Mitsis, Eleni Papakonstantinou and Dimitrios Vlachakis

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us