Computational Systems Analysis on Polycystic Ovarian Syndrome (PCOS)

Complex diseases are caused by a combination of genetic and environmental factors. Unraveling the molecular pathways from the genetic factors that affect a phenotype is always difficult, but in the case of complex diseases, this is further complicated since genetic factors in affected individuals might be different. Polycystic ovarian syndrome (PCOS) is an example of a complex disease with limited molecular information. Recently, PCOS molecular omics data have increasingly appeared in many publications. We conduct extensive bioinformatics analyses on the data and perform strong integration of experimental and computational biology to understand its complex biological systems in examin-ing multiple interacting genes and their products. PCOS involves networks of genes, and to understand them, those networks must be mapped. This approach has emerged as powerful tools for studying complex diseases and been coined as network biology. Network biology encompasses wide range of network types including those based on physical interactions between and among cellular components and those baised on similarity among patients or diseases. Each of these offers distinct biological clues that may help scientists transform their cellular parts list into insights about complex diseases. This chapter will dis-cuss some computational analysis aspects on the omics studies that have been conducted in PCOS.


Introduction
Findings have shown that most pathological conditions and diseases involve genetic components, in diseases such as cystic fibrosis, hemophilia, and sickle cell disease, are caused by mutations in a single gene [1][2][3]. However, there are many other common medical problems such as cardiovascular diseases, diabetes mellitus, obesity, and polycystic ovarian syndrome (PCOS), which are not caused by single mutations [4][5][6][7]. The etiologies of those problems are much more complex where these disorders are highly associated with multiple genes/proteins in combination with multifactor including genetics, environment, and lifestyle. Many efforts have been done to overcome the complexity of these medical problems. Studying diseases at the molecular level is one of the efforts in understanding complex diseases. The emergence of biological technology has yielded great advances in deciphering the pathobiology of diseases by generating numerous large omics (genomics, transcriptomics, proteomics, and metabolomics) datasets. These data capture a wide range of disease phenomena including mutations, gene expression, protein expression, metabolite profiling, and genetic and physical interactions between biological molecules, where each dataset offers distinctive of knowledge to understand the diseases. Complex diseases are insufficient by a single level independent omics dataset since those diseases are regulated at multiple systems levels. They can be manifested by integrated omics analysis (integration of multiomics data).
The multi-omics analysis has brought a new challenge to develop methods or pipelines, statistics, algorithms, and tools for integration, and the assistant of computational systems analysis is in great need. Implementing integrative analysis on these multiple omics data is the best way in deriving systematical and comprehensive views of diseases, achieving a better understanding of disease mechanisms and finding operable personalized health treatments. With the help of computational systems analysis, research in the field of biology and biomedicine has gained tremendous benefits over the past few decades.
Computational systems analysis connects interdisciplinary perspectives with mathematical, algorithms, statistical, modeling and simulations, data repository, and/or network visualizations using computational technique to investigate certain biological phenomenon or condition in a systems view. Currently, there are many studies on the integrated omics data and used network biology, which is one of the main techniques in computational systems analysis to obtain an overview at the systems level in elucidating the pathobiology of human diseases. Network biology could systematically connect all the molecules generated from the omics studies that have been identified to be related to the disease. Other than network biology, there are studies that used simulations approach to have a better understanding of diseases. Database development is another computational systems analysis that serves to provide overall information about the diseases. This chapter encompasses the computational systems analysis, such as network biology, simulations, and data repository, which have been used to understand the pathobiology of human diseases, particularly in PCOS.

Network biology in disease
Early biological experiments revealed that proteins, as the main agents of biological function, determine the phenotype of all organisms. In the advent of molecular biology, it is assumed that proteins do not naturally function in isolated forms; instead, they have interactions with one another and also with other molecules (e.g., DNA, RNA, and metabolites) that mediate metabolic, signaling and regulatory pathways, cellular processes, and organismal systems [8]. Most of the biological characteristics or phenotypes arise from the complex interactions between the cell's numerous constituents [9]. Any interruptions to the interactions between those molecules can disturb the normal behavior of the cells and contribute to the medical problems or diseases [10]. Thus, studies on network biology in disease are essential as it can be used to detect interrupter biological events since the network biology plays a role to perceive the biological role within the cells [11].
Computational Systems Analysis on Polycystic Ovarian Syndrome (PCOS) DOI: http://dx.doi.org /10.5772/intechopen.89490 In network biology, there are two types of analysis that often be performed to understand the pathobiology of diseases, that is, protein-protein interaction and pathway analysis.

Protein-protein interaction analysis
Protein is a biological molecule that plays an important role in the molecular process in a cell. It acts as an enzyme for metabolic reaction, DNA replication, molecular transporter, antigen defensive system, and cell to cell information transmission [12]. Proteins physically interact with each other to perform a biological function in a cell. Protein-protein interaction (PPI) has become a valuable approach to study the molecular mechanisms of disease [13]. For example, non-metastatic and metastatic breast tumors, as well as the markers of metastasis, have been classified and identified by a network-based method. Based on this study, the said method is more effective because it enables detection of the genes that play a role in metastasis, which is not otherwise picked up during differential expression analysis [14]. Protein networks for type-1 diabetes were constructed by integrating GWAS data with the information from protein-protein interaction databases. Eight new genes were subsequently identified, hence providing better knowledge of the mechanism of type-1 diabetes [15]. Besides, new pathways have been defined from the protein network-based Huntington, giving a deeper understanding of the pathogenesis of Huntington disease [16]. These studies indicate that a network, particularly that of proteins, could be one of the powerful tools in understanding the molecular basis of diseases. Thus, this method could be applied to unveil the molecular basis of PCOS.
Combination of PPI forms a network consists of two main components, i.e. (1) node that represents protein and (2) edge that refers to interaction (Figure 1). PPI network has been applied for evolutionary study [45], gene/protein functional prediction [46], and also pathobiology of diseases [47,48]. There are few analyses that can be applied using the PPI network approach, and PPI network topological analysis is one of the analyses that often are used to study the pathobiology of human diseases. Degree distribution, which is a fraction of a number of the interaction of a node with the number of the interactions in a network, is one of the components in the network topology that have been measured. A node that has a high degree distribution is known as a hub protein. A hub protein is hypothesized to code an essential gene that plays an important role in a cell. Any physical or chemical alterations that occur to this hub protein can interrupt the interaction with other proteins, disturb the normal behavior of the cells and associated to a disease. Previous study by Wachi et al. found that proteins encode for the upregulated genes in the lung squamous cell carcinoma tend to have higher degree distribution [49]. Jonsson and Bates also found 346 proteins-related to cancer have two times higher degree connectivity compared to the non-cancer proteins [50]. Number of interaction among proteins that are related to disease in the Online Mendelian Inheritance in Man (OMIM) Morbid Map is higher than the interaction of non-disease proteins [51]. Linkage method is another network analysis that can be used to understand the pathobiology of human diseases [50,52]. The basic hypothesis in this method is the two proteins (pairwise linkage) that interact with each other tend to be related to the same diseases. Enrichment analysis done by Oti et al. [53] has demonstrated that the proteins that interact with each other are significantly associated with the same diseases. By a pairwise linkage method, they also predicted that Janus kinase 3 (JAK3) as a protein that might be associated with severe combined immunodeficiency syndrome (SCID) as JAK3 directly interacts with proteins of lymphocyte specific protein-tyrosine (LCK), protein-tyrosine phosphatase (PTPRC), and interleukin 2 receptor (IL2RG) [53].
Clustering is also a technique in network analysis in human diseases. A cluster refers to a small group that has similar topological network properties [48,54]. In this method, it is hypothesized that the proteins in a module tend to be associated with the same diseases. Clusters are identified using algorithms, and there are several clustering algorithms that have been developed to generate the clusters such as CFinder [55], clustering with overlapping neighborhood expansion (ClusterONE) [56], clustering based on maximal cliques (CMC) [57], clique percolation method (CPM) [58], density-periphery based clustering (DPClus) [59], density-periphery overlappingbased clustering (DPClusO) [60,61], identifying protein complex algorithm (IPCA) [62], local clique merging algorithm (LCMA) [63], restricted neighborhood search clustering (RNSC) [64], Markov clustering (MCL) [65], molecular complex detection (MCODE) [66], and so on. Wu et al. developed the clustering algorithm to identify the clusters, and they found that the proteins in the same clusters are associated with the same diseases [67]. Rezaei-Tavirani et al. identified clusters using ClusterONE algorithm to search for potential biomarkers in esophagus adenocarcinoma [68]. Xiao et al. also used clustering methods to identify the candidate proteins for endometriosis biomarkers by their own clustering algorithm and they found the majority of predicted biomarkers in the generated clusters involved in endometriosis pathway [69]. PPI network can be also applied to understand the association between diseases as clinically and there is the occurrence of comorbidity, which is a condition of a patient that is simultaneously affected more than one disease. Disease association network based on PPI analysis can be used as a framework to classify the disease, identify the risk of having other diseases, predict the effect of disease, and search for a more effective therapeutic technique for disease [70,71]. There are few hypotheses in constructing the disease association network, and one of them is diseases can be associated when those diseases shared the proteins and interactions (Figure 2).
The components in a disease association network are similar to the PPI network. It consists of node and edge, where node refers to disease and edge is the interaction of disease. The first human disease network has been constructed among 867 diseases using PPI information by Goh et al. (Figure 3) [73]. This network has been used to understand how diseases comorbid to each other by identifying the shared proteins and interactions between the diseases [72]. The disease association network is also useful to predict the disease biomarkers. Ahmed et al. have successfully identified 73 potential biomarkers for neurological diseases, that is, Alzheimer's disease, epilepsy, and dyslexia, by integrating the protein-disease association with the PPI information [74].

Figure 2.
Two different approaches to identify the association between diseases. The first approach used shared nodes (red nodes) between diseases. The second approach used shared interactions (blue edges) between disease-related proteins [72]. To summarize, PPI analysis is a powerful approach that can be applied to improve the understanding of the pathobiology of diseases, which in turn can appraise approaches to diagnose, prevent, and treat the diseases. The analysis of  Example of a biological pathway. This is the insulin signaling pathway that was retrieved from the KEGG database [80]. network properties can provide the opportunity to interpret the normal and altered biological behaviors that lead to diseases.

Pathway analysis
A pathway is a group of molecules that interact to perform the same biological function. PPI network has a type of node that is protein and the undirected network (Figure 4). Meanwhile, the pathway consists of a few types of nodes, which are signaling genes, proteins, complex, and metabolites, which are connected by several interactions such as activation, inhibition, binding, and others. A pathway depicts a mechanism in performing a specific biological activity in a cell. As a PPI network, a combination of several pathways forms a pathway network. Pathway information can be retrieved from several pathway databases such as Kyoto Encyclopedia of Genes and Genomes (KEGG) [75], Reactome [76], WikiPathways [77], BioCyc [78], and BioCarta [79]. Table 2 shows databases that have pathway information in human.
There are three main types of pathways, that is, signaling, regulatory, and metabolic pathways. Signaling pathway visualizes the cellular response after receiving the extracellular signal. The signal transmission starts when the extracellular gives a signal to activate the receptor that is located in the cell surface. The activated receptor will bind to the signal and alter the intracellular molecules to respond [84]. Any disruption in the signaling pathway can cause disease since the cells cannot be normalized or properly respond when the signals are received [85]. Regulatory pathway displays the gene or protein expression in a cell, either it is upregulated or downregulated. The biological activities such as transcription, translation, and post-translational modification are among the activities that involve the regulatory pathway [86]. Meanwhile, in a metabolic pathway, the primer metabolite will be modified into another metabolite through a series of chemical reactions catalyzed by enzymes [87].
Pathway database such as KEGG also provides pathways that visualize the mechanisms of several complex diseases such as cancer, diabetes mellitus, Alzheimer's disease, Parkinson's disease, and so on [75]. Basically, a complex disease involves several pathways that include all signaling, regulatory, and metabolic pathways. The combination and the integration of several pathways with other types of data such as PPI is one of the valuable approaches that can be used to improve the understanding of complex disease mechanisms [88].
As an analogy, it is essential to have a diagram such as a circuit diagram for an electrician to understand the principle of electricity. A diagram such as a biological network is also important in the medical field to assist the researchers or clinicians to understand the mechanisms of diseases. The biological network can suggest a novel means of developing molecular therapies where the network is the target of therapy rather than individual molecules within the network.

Modeling and simulation
Mathematical modeling and computer simulation are another computational systems analysis that has been used to study disease progression and drug development [89,90]. While the biological network is generally constructed in the static state using annotated genes, proteins, and metabolites and linked these molecules using information from PPI and pathway databases, modeling and simulation are constructed in quasi-steady state, where they require additional data including physicochemical and physiological balances and bounds (mass and energy conversion) [91]. Modeling and simulation have been widely used in several chronic diseases such as diabetes, Alzheimer's disease, coronary heart disease, and infectious diseases such as meningitis and influenza [89,[92][93][94][95]. In this approach, there are several types of models that have been applied to understand the human diseases, which are pharmacokinetics (PK) model, pharmacokinetics/pharmacodynamics (PKPD) model, disease progression model, metamodel, and Bayesian model averaging [90]. PK model is widely used in the field of clinical pharmacology as it simulates the rate and extent of drug distribution to different tissues and the rate and impact of drug disposition. It is a very important model as it predicts the impact variability in target patient populations in response to drug administration [96]. PKPD model is another model in the drug development where it integrates PK and PD components. This model establishes and measures the relationships of dose-concentration-response and describes and predicts the effect-time courses in consequence of a drug dose [97]. Meanwhile, the disease progression model is the time course quantitative descriptor of disease status. It was first simulated in 1992 in Alzheimer's disease using the cognitive component of the Alzheimer's disease assessment scale (ADASC) to assess the dis ease severity [93]. This model characterizes the natural progression of the disease by incorporating biomarkers of disease severity and/or clinical outcomes. Disease progression model is often used to quantify the effects of drug treatment on disease progression by integrating with PK and PKPD models [98]. Metamodel involves model development by combining results from multiple previous studies. In human disease study, this model can be used to compare the effects and safety of new treatments with other treatments, to reevaluate data of mixed or different result situations, and to describe PD or disease progress models [90,99]. In the meantime, Bayesian model averaging combines models as there is a situation where previous studies show several models for a drug in a certain disease, and it is unclear which model is suitable. The Bayesian model averaging reduces the uncertainty by allowing all existing models to contribute to a simulation with weighing the inputs on the basis of certain criteria such as the quality of data or model [90,92].
Complex diseases involve many genes, proteins, and metabolites, and these molecules are either activated or deactivated in certain tissues in particular time, depending on the disease status or in the influences of several factors such as drug administration. Hence, modeling and simulation are efficient approaches in the computational systems analysis as these approaches manage to dynamically monitor and understand the progress of diseases in particular situation, which in turn can assist in improving the specific treatment and developing the efficient drugs for complex human diseases.

Data repository
Data are the most important resource in computational systems analysis. Most of the analyses require the integration of several data to understand the diseases, particularly complex diseases in a systemic view. For example, several omics data (genomics, transcriptomics, proteomics, and/or metabolomics) were integrated with interactions data (PPI or pathway) to construct network biology. Modeling and simulation also involve omics data integration to capture the complexity of molecular events causing the diseases. In addition, cellular and physiological processes are complex systems [100] that are controlled by signals from the extracellular environment and coordinated by intracellular interaction and transcriptional or gene regulatory networks assembled into functional modules [101]. In order to understand cellular processes as interconnected and interdependent systems and in the context of a biological phenomenon, requires an integrative approach that draws upon data from as many diverse data sources as possible including data from the literature, public databases, biochemical and kinetic experiments, phenotype studies and high-throughput analyses of the genome, transcriptome, proteome, interactome, and metabolome.
Hence, data repository or database development is one of the main approaches to facilitate the arbitrary querying of the data to perform the computational systems analysis. Besides, recent developments in high-throughput approaches enable the analysis of the transcriptome, proteome, interactome, metabolome, and phenome on a previously unprecedented scale, thus contributing to the deluge of experimental data and scattering in an unorganized way. The data repository is one of the efforts in combining the growing sets of experimental data in a proper way that can be publicly accessed to have further analysis. For example, there are databases such as ArrayExpress [102], Gene Expression Omnibus (GEO) [103], and CIBEX [104] that stores datasets for gene expression studies to be publicly accessed. Other than that, there are also literature databases such as PubMed, Scopus Online, and Google Scholar for the researchers to retrieve published studies, and there are several studies that provide the generated omics datasets in the supplementary section.
There are also databases such as disease databases that have performed several analyses prior to deposit the data into the database. The human disease databases have been developed in order to store information about diseases such as genes, proteins, metabolites, drugs, literature, biological processes, tissues, and others that are related to a particular disease in order to understand the pathobiology, pathogenesis, and pathophysiology of diseases. Currently, databases, such as DisGeNET [105], MalaCards [106], Online Mendelian Inheritance in Man (OMIM) [51], Open Targets [107], GWAS Catalog [108], GWASdb [109], DISEASES [110], and Human Gene Mutation Database (HGMD) [111], have been developed to store several information about human diseases. There are also databases that have been developed that specifically store data or information of a disease such as T2D-Db [112] and T2D@ ZJU [113] for type-2 diabetes, AlzBase [114], AlzGene [115], and NIAGADS [116] for Alzheimer's disease, and The Cancer Genome Atlas (TCGA) [117] and The International Cancer Genome Consortium (ICGC) [118] for human cancers.
Nowadays, the number of databases that hold a growing number of generated data is also increased, which has led to a new challenge in selecting the best and suitable database for further computational systems analysis. Nevertheless, the presence of current available data repository or databases has eased the researchers without having to extensively search the data to integrate the data and visualize the data into a network and/or model in order to harness a comprehensive systems-level understanding of pathophysiological processes of human diseases.

Computational systems analysis progress in PCOS
PCOS is a heterogeneous disorder that may be affected by multiple factors including genetic, lifestyle, and environment. The definition of PCOS is unclear, where it is defined by a combination of different features that lead to its diagnostic criteria remain controversial. PCOS women also experience multi-symptoms, and the diseases that comorbid to PCOS are widely varied [6,119]. The complexity in PCOS is evident that many genes, proteins, and metabolites involved in the pathobiology of PCOS. All omics platforms have been applied to identifying the molecular basis of PCOS (Table 3) [120].
Even though all omics have been performed in PCOS, the pathobiology of PCOS is still far from understood. Since the prevalence of PCOS women is increased and if they are left untreated, PCOS women are at higher risk to develop other chronic diseases (endometrial cancer, type-2 diabetes and cardiovascular diseases), and other approaches such as computational systems analysis need to be done to improve the understanding in PCOS. By far, several studies have integrated the omics platforms using computational systems analysis to provide a systems-level understanding of PCOS.

PPI and pathway analysis
In PCOS, PPI-and pathway-based analysis is also often used to identify the genes/ proteins, ontologies, and pathways that might be involved in this disorder. Among the earliest full-paper study in using PPI analysis in PCOS was published in 2009. In this work, Mohamed-Hussein and Harun combined seven microarray datasets and integrated with PPI information and successfully identified a hypothetical protein, C1ORF123, and several ontologies that might be highly involved in PCOS [128]. Prior to this study, there is an article outline in 2007 by Menke et al. that used a Newman algorithm to identify the small set of modules in the constructed PCOS PPI network that could lead to PCOS phenotypes [129]. Shen et al. [130] have constructed the regulatory network and PPI network by integrating several data such as genome-wide methylated DNA immunoprecipitation (MeDIP), regulatory interactions and PPI to investigate the relationship of insulin resistance (IR) with PCOS. In a regulatory network, the significant methylated genes, CCAAT enhancer binding protein beta (CEBPB) formed a network that regulated other genes that may play a role in both IR and PCOS. Meanwhile, the constructed PPI network showed that the methylated genes in PCOS-IR have a higher number of interactions and might act as key drivers to perform proper cellular functions. Shen et al. [130] also found several enriched pathways such as cancer pathways and MAPK signaling and ontologies including regulation of metabolic process from both constructed networks that might be responsible in both PCOS and IR [130]. Shim   association study (GWAS) dataset of PCOS and successfully identified several PCOS pathways associated with ovulation and insulin secretion [131]. Kori et al. [132] used PPI and pathway analysis by integrating three microarray datasets of PCOS with PPI data, performing the pathway enrichment analysis and comparing the PCOS results with ovarian cancer and endometriosis. These analyses found that PCOS is closely related to endometriosis and ovarian cancer as they shared several molecules and pathways such as MAPK signaling, cell cycle, and apoptosis [132]. The integration of a microarray dataset with PPI information from REACTOME has found several proteins including Rho GTPase activating protein 4 (ARHGAP4), Rho GTPase activating protein 9 (ARHGAP9), ras homolog family member G (RHOG) and LYN proto-oncogene, Src family tyrosine kinase (LYN), and pathways such as RhoA-related pathways, and glycoprotein VI-mediated activation cascade might involve in the PCOS pathogenesis [133].
Other than identifying the molecular basis and the biological functions that might relate to PCOS, PPI and pathway analysis are also applied to decipher the molecular relationship of PCOS with other diseases and improve the knowledge on PCOS treatments. Liu et al. [134] construct a PPI network, which consists of PCOS-related genes and target genes of Erxian decoction (EXD) to understand the pharmacological basis of the EXD action in treating PCOS. EXD is a traditional Chinese medicine composed of six types of herbs that can alleviate several problems such as ovarian failure, which is a problem that commonly experiences by PCOS women. In the constructed network, Liu et al. [134] identified 50 genes that might be key genes that involved in PCOS treatment with EXD since these genes are the EXD targets that are found to be related to PCOS [134]. Ramly et al. [135] also used PPI and pathway analysis to identify protein and pathways to explain the relationships between PCOS and 17 diseases such as migraine, ovarian cancer, and schizophrenia. They used a clustering approach by MCODE [66] to identify shared proteins between PCOS and other diseases and pathway enrichment analysis to identify pathways that might connect PCOS and PCOS-associated diseases [135].
Based on aforementioned studies, it is proved that PPI-and pathway-based can be used to identify genes/proteins, biomarkers, ontologies, and pathways that are related to PCOS, which in turn could improve the diagnosis and treatment in PCOS.

Data repository in PCOS
As mentioned, there are many datasets that have been generated by the omics platforms to identify the pathobiology of PCOS. The datasets are randomly distributed, and it is very tedious if the researchers intend to retrieve the information about PCOS. Hence, it is essential to have a repository that stores comprehensive information on PCOS.
There are three databases that have been developed by far to deposit the collated molecular information generated by previous studies, which are PCOSBase (www. pcosbase.org) [136], PCOSKB (http://pcoskb.bicnirrh.res.in/) [137], and PCOSDB (http://pcosdb.net/) [138]. Both PCOSKB and PCOSDB contain 241 and 208 genes that related to PCOS, respectively. These databases searched for the PCOS-related genes against scientific literature. Meanwhile, PCOSBase identified 8185 PCOSrelated proteins that were obtained from previous disease databases and gene and protein expression studies. All of the PCOS databases provided detailed description for each entry that is related to PCOS and link to the original databases such as UniProt (https://www.uniprot.org/) and NCBI (https://www.ncbi.nlm.nih.gov/) for extensive information. As PCOSBase, biological information such as chromosomal location, gene ontologies, pathways, domains, disease-associated, and tissue localization have been annotated to all PCOS-related proteins. Figure 5 shows the example homepage of PCOSBase, where it provides search box to facilitate the users to search with keywords and shows the number of entries for each functional details that are deposited in the database.
All of these databases are developed as an effort for other researchers in identifying PCOS biomarkers. Besides, the information from the databases has been used to integrate with other information such as PPI and pathway to have a systems-level view of PCOS. PCOSBase provided a menu ("Network") that contained a biological network of PCOS as examples of analysis on the PCOS-related proteins from this database. The network provided in the database can give an insight into improving the knowledge, particularly in PCOS.

Conclusion and future perspective
PCOS is an endocrine disorder that linked many clinical symptoms and the diversity of diseases. The PCOS complexity requires the development of novel analysis methods such as the simultaneous analysis of omics data using computational systems analysis. In addition, the availability of multi-omics datasets has opened the avenue to gain new insights into related molecular pathophysiological changes in PCOS. Thus, the previously generated data should be fully utilized as a whole to have a systems-view of PCOS. As mentioned in this chapter, the The specific data repository of PCOS has also been developed, which could be used for further analysis by PCOS researchers. However, there is a lack of studies that integrate the omics datasets using modeling and simulation to investigate PCOS in a systems-level. This approach should be put into consideration in the future as this approach can dynamically elucidate the PCOS progression and improve the PCOS diagnosis and treatment. Although there is a limitation particularly the state of the incompleteness of biological information such as human interactome and pathway annotation, the analysis on current data by computational systems analysis should be continuously performed as these efforts could constantly enhance the knowledge of a complex syndrome, which is PCOS.