Bayesian Modeling Approaches for Temporal Dynamics in RNA-seq Data Bayesian Modeling Approaches for Temporal Dynamics in RNA-seq Data

Analysis of differential expression has been a central role to address the variety of bio - logical questions in the manner to characterize abnormal patterns of cellular and molecu lar functions for last decades. To date, identification of differentially expressed genes and isoforms has been more intensively focused on temporal dynamics over a series of time points. Bayesian strategies have been successfully employed to uncover the com - plexity of biological interest with the methodological and analytical perspectives for the various platforms of high-throughput data, for instance, methods in differential expres - sion analysis and network modules in transcriptome data, peak-callers in ChipSeq data, target prediction in microRNA data and meta-methods between different platforms. In this chapter, we will discuss how our methodological works based on Bayesian models address important questions to arise in the architecture of temporal dynamics in RNA- seq data.


Introduction
The differential expression analysis across external conditions (e.g. drug treatments, or between-or within-cell/tissue types) in stimuli-response data has long been crucial part on clinical applications [1][2][3][4][5][6][7][8][9][10][11]. The primary goal of these studies is to target therapeutic effects on genes and their pathways that are highly associated with the alterations between different conditions, corresponding underlying biological mechanisms, and condition-specific molecular processes from microarray until recent RNA-seq platform [1-3, 5, 6, 9-43]. Such substantial effects on transcriptome data involved by various types of human diseases have significantly addressed fundamental issues characterized by biological phenomena in transcriptional regulation. For instance, examination of classification of subtypes on hereditary breast and ovary cancer [44][45][46], reciprocal phylogenetic conservation and heterogeneity between closest animal models of aging and depression in brain tissues [47][48][49], identification of differential expression on enzyme effect in Gaucher's disease across distinct three tissues [14,50], developmental transient patterns in mouse embryonic stem cells in pre-frontal cortex and limb tissues [13,51], and et cetera.
The characterization based on Bayesian strategies for whole-genome wide transcriptome data and other types of Seq data has been successfully addressed on the variety of questions to arise in biomedical community [8,9,52]. As a naïve Bayesian method, baySeq framework has been proposed in differential expression analysis between different groups with replicates [52]. ShrinkBayes has been developed to identify differential expression analysis at static data on the basis of zero-inflated Poisson Gamma model and Integrated Nested Laplace Approximation to estimate shrunken parameters [53]. And another Bayesian technique in ChipSeq, BayesPeak has been developed to detect significantly enriched regions in transcription factor binding sites and histone modification datasets. It is on the basis of Bayesian hidden Markov model strategy and MCMC simulations with Poisson Gamma distribution to take into account over-dispersion in the abundance of read counts in different regions by comparing to the existing methods of peak callers in ChipSeq data, MACS, PeakSeq, and ChipSeq Peak Finder [54]. Additionally, Bayesian approaches to define differential expression of alternative splicing in RNA-seq have been proposed by MISO (as known as mixture of isoforms) model and MATS (as known as multivariate analysis of transcript splicing) [9,18,55]. MISO method is to estimate differential expression of alternatively spliced exons and isoforms based on Bayes factor to quantify the odds of differential regulation of given isoform for the ratio of inclusion and exclusion levels. Similarly, MATS (and rMATS) is implemented to test differential alternative splicing patterns by estimating exon inclusion levels between two samples without (and with replicates), respectively. In addition, microRNAs target prediction methods that play a key regulatory role in gene regulation on the variety of biological processes in human diseases, especially, cancer development, have been proposed by Bayesian methods [56,57]. And also, Bayesian network methods have been proposed to predict the important functions of long non-coding RNAs as well as coding genes in RNA-seq [58,59]. Thus, RNA-seq has become the alternative in transcriptome studies with advantageous features over arrays, dynamic range of expression signals, higher reproducibility and quality on samples and upgraded annotation without any known priori [3,16,19,20,24,[26][27][28]32].
With the strength on improvements on technology and continuously declining cost to sequencing, RNA-seq enables to facilitate to perform more dense experimental designs such as time course data with abundant resources of dynamic gene regulation. Furthermore, transcriptome data and other types of meta-framed data across different platforms will be more popularly investigated in years to come. Indeed, next generation sequencing technologies have been steadily improved with higher throughput, longer reads, deeper sequencings, larger samples size of replicates, and less biases on data. Such advances allow investigators to conduct more complex experimental studies, various types of time course experiments, such as single series of longitudinally measured time course and transient dynamic patterns in developmental stages; multi-series of factorial time course with multiple external conditions at each time point; cell-cyclic periodical data with (or without) external stimuli [7,8,22,23,29,30,60].
To date, despite the substantial potential and significance to explore temporal dynamics at genes and other genomic features in various human disease progressive models and therapeutic effects, the lack of analytical methodologies in order to precisely characterize temporal dynamics has been an important challenging issue to better understand biological mechanisms relevant with both time specific and responsively altered changes by given external stimulus.
In this chapter, we propose Bayesian approaches to better infer differential expression in temporal and spatial dynamic regulation that can be widely adopted in the community of biomedical research such as pediatric disease progressive models, age-related neurodegenerative diseases, other types of longitudinal and multi-series of time course data.

Data types of time course experiments
Dynamic gene regulation within a time window in transcriptome data is generally subcategorized into (1) within-subject longitudinally repeatedly measured stimuli-response data in a single series of time course experiment, (2) between-subject factorial multi-series of time course data with different conditions at each time point, and (3) periodical data in cell-cycle or circadian rhythmic patterns with or without external conditions. For the first type of temporal dynamics, we initially proposed Bayesian Poisson Gamma (negative binomial model) strategy to identify temporally differentially expressed genes in the previous study [23]. In this model, each gene is statistically tested whether it is equal or differential expression by auto-regressive (AR) model. The detailed description for notations is given in the following, Model I: supposedly, a gene expression profile across a series of multiple time points, y grt is independently distributed as Poisson Gamma (negative binomial model to account for variability of biological replicates within a group), g = 1, 2, … , G (gene or other genomic feature), t = 1, 2, … , T (time point), and r = 1, 2, … , R (biological replicate at each time point), In this proposed model, β g is assumed to have non-informative priors and time series random effects model for sequentially measured single series of time course data is assumed. To update the defined auto-regressive model and estimate the posterior probabilities of given parameter sets, we employ Markov Chain Monte Carlo simulations, N = 10,000 iterations and 8000 burn-ins. To define temporal dynamics whether or not a given gene is temporally differentially expressed, we further examine the most interesting parameter of auto-coefficient representing time series sequential random effects in this proposed model. To classify between equal and differential expression, we implemented to compute Bayesian credible interval estimates. This proposed model is implemented by OpenBUGS (WinBUGS) in R (submitted paper). Basically, it allows us to include major factor of time and variability of replicates at each time point and this simple linear auto-regressive (AR) model is straightforwardly extended to identify temporally differentially expressed other genomic features, for instance, quantified abundance of transcripts. Despite the much better improved quality of samples in RNA-seq data when compared to the beginning of technology, the preprocessing and normalization procedures are still required to precisely infer temporally differentially expressed genes and isoforms and to reduce the misleading results in subsequent analyses. And some samples are discarded in the preprocessing step such as due to sample preparation in experiments and the rest of corresponding samples for the given missing sample should be also deleted as it is the method for longitudinal data with repeated measurements across the series of time points. More importantly, RNA-seq data is highly skewed expression data toward to zero and low expression levels than high expression level. Collectively, in the format of longitudinally measured experimental designs in temporal dynamics of RNA-seq data, as further improved strategy, we are currently developing a zero-inflated Poisson Gamma model with missing observations in longitudinal data to improve the capability of detection of temporal changes in highly skewed count data [61,62]. The detailed descriptions and notations are given in the following, we consider the conditional distribution of y grt | E grt , where E grt is the binary indicator of whether a gene expression profile across a series of multiple time points is present for a gene g, time t and replication r. We also assume that conditioned on p grt , the E grt 's are independent Bernoulli random variables with P ( E grt = 1 ) = p grt for g = 1, 2, … , G , t = 1, 2, … , T , and r = 1, 2, … , R . Here, given E grt = 1 , assume that the y grt 's are conditionally independent.
Compared to arrays, the major strength of RNA-seq transcriptome data enables to quantify and identify spliced isoforms as well as individual exon-level expression which had not been previously done due to low resolution [8-11, 13, 14, 25, 28, 30, 33, 35, 60, 63-68]. In addition to gene level analyses, it is well established that alternative splicing is a prevalent mechanism on the variety of organisms. It involves multiple selective schemes of splice sites to construct diverse functional pathways and protein structures in gene regulation. A single mRNA may code for different forms of a protein (isoform) as a result of alternative splicing that increases the complexity of mammalian transcriptome.
In the previous literatures, deep sequencing based transcriptome data predicts that more than ~95% of human genes typically contain multi-exons and transcript-variants to undergo alternative spliced events [8,9,25,28,33,35,[63][64][65]67]. The aberrant alternative splicing events occurring in post-transcriptional and translational procedures are highly associated with different tissues-or developmental stages-or environmental condition-specific manner. It has been investigated that malformation and dysfunctional mechanisms by the majority of abnormal alternative splicings in human brain diseases. Aberrant patterns of splicings in neurodegenerative Alzheimer's patients and other types of pediatric cancer progression could be significant contributors to targeted therapies on disease progressive models and developmental evolutionary processes in transcriptional activity in temporal dynamics [28,[63][64][65][66]. Despite the importance of alternative splicing events in recent technology, characterization of dynamic processes has been merely limited to gene level and static data approaches.
In the methodological point of view, for quantification and identification of isoforms, a couple of bioinformatics tools including IQSeq, rSeq, MapSplice have been recently developed [15,17,31], and identification of differentially expressed isoforms at static data types have been pursued by MATS (focused on a specified experimental design on a sample versus another single sample comparison at a fixed time point) [9], DEXSeq (flexibly to allow various types of experimental and biological conditions in generalized linear model from the basis of multiple comparisons at exon levels) [5], and cufflinks and cuffdiff (as known as on the of most popular versatile tools for quantification and identification of differential expression at isoform levels, but restricted to simple pairwise comparison with replicates) [10,11,30,68].
To our best knowledge, none of current static and dynamic methods can identify temporally differentially expression at alternative splicing by explicitly accounting for data-driven nature of various time course experimental settings.
For the first type of longitudinal time course experiments, quantified expression levels at isoforms and other types of genomic features can be directly applied for our proposed dynamic AR model.
For the second type of between-subject factorial multi-series of time course data, another our previous study [8] proposed a hierarchical Bayesian modeling approach to define differential expression analysis at isoforms when having multiple conditions at each time point, such as different tissues, drug treatments, stress, and trauma in temporal dynamics (see y grtc | γ g = d~POI ( μ grtc ) , l represents distinct patterns between conditions across time points, Namely, l g = l iff gth gene belongs to cluster l, Let L denote the number of clusters on expression data, where DP(H; α ) stands for the Dirichlet process having its baseline distribution H( • ) and mass parameter α. And, F 0 represents the mixture format of β with p( β ) = N for equal expression (EEX) and G N (δ, σ 2 ) for differential expression (DEX) and η 0 , ψ 0 , and M are fixed hyper-parameters based on prior information [6,69]. Therefore, the posterior probability is given by, P( γ g = d | l g = l, y g111 , … , y g,r=R,t=T,c=C ) for the temporally differentially expressed gene.
Based on our proposed Bayesian approach for multi-series of factorial time course data, we are currently implementing OpenBUGS(WinBUGS) in R to perform differential expression analysis. In order to validate our proposed model, we need to compare to other maSigPro for RNA-seq data and Gaussian Process modeling approach in terms of temporally differentially expressed genes in the multiple datasets after transformation of stabilizing variance on counts data.

Closing remarks in future directions
In earlier sections, we have discussed Bayesian techniques to address different types of experimental (clinical) settings in temporal dynamics by focusing on Poisson Gamma autoregressive model for longitudinally measured single-series of time course data and hierarchical Dirichlet Bayesian mixture model framework for multi-series of factorial time course data, respectively. Thus, as the continuous efforts to modeling approaches, we propose differential expression analytical frameworks that precisely characterize temporal dynamics for each type of stimuli-response data in this chapter.
The novel features in the proposed hierarchical Dirichlet Bayesian mixture model enables to identify significant temporal changes of expression levels between at least two external conditional factors over a series of time points based on Bayesian strategy grouping clusters by the patterns of differential or equal expression [6,69]. The identified temporal changes are determined as the putative biomarkers that could be relevantly linked with various dynamic genetic mechanisms of molecular and physiological processes. Additionally, our proposed method enables to allow more than two genetic and environmental factors within a time point and to address how the intra-factor of multiple conditions and time factor affect altered expression patterns as significant contributors independently and interactively. Furthermore, this proposed model is straightforwardly extended to detect temporal changes at other genomic levels such as transcripts and exon levels.
As the extension of this study, we are currently developing how to measure the relationship of a parent gene-to-multiple child isoforms in temporal dynamic patterns. For the task of this procedure, we carry out directional comparison, gene-to-isoform in differential expression based on similarity and discrepancy on magnitude and pattern of expression. This proposed model enables to define connectivity visualization of splicing maps on the variety of structural formations by switchable exon usage. Moreover, we are currently developing differential expression method for cell-cyclic periodical data with or without external conditions in stimuli-response data [70]. Thus, this proposed study is timely crucial to define temporal dynamics at alternative splicing diversity related with disease progression by discovering which splicing events are condition and time specifically observed and how eventually their spliced abnormal patterns and splicing maps are associated with biological functions. And it is essential to develop strategies to correct aberrant splicing as well as gene approaches in temporal and spatial dynamics on the variety of disease progression and evolutionary comparative studies between human diseases and other closely related species.

Conflict of interest
The authors have no conflicts of interest to disclose.

Contributors' statements
SO wrote manuscript and SO and SS conceived this study.
For discloser of any prior publications or submission with any overlapping information including studies and patients, there are no prior publications or submissions with any overlapping information including studies and patients.
The manuscript has not been and will not be submitted to any other journal while it is under consideration by this book chapter in Bayesian inference.
All authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.

Funding source
This study is supported by an internal grant from Jeju National University to Dr. Sunghee Oh.

Financial disclosure
The authors have no financial relationships relevant to this article to disclose.