Models of RNA Interaction from Experimental Datasets: Framework of Resilience

Resilience is a network property of systems responding under stress, which for biomed- icine correlates to chronic or acute insults. Current need exists for models and algo-rithms to study whole transcriptome differences between tissues and disease states to understand resilience. Goal of this effort is to interpret cellular transcription in a dynamic system biology framework of RNA molecules forming an information struc- ture with regulatory properties acting on individual transcripts. We develop and evalu-ate a bioinformatics framework based on information theory that utilizes RNA expression data to create a whole transcriptome model of interaction that could lead to the discovery of new biological control mechanisms. This addresses a fundamental question as to why transcription yields such a small fraction of protein products. We focus on a transformative concept that individual transcripts collectively form an “ infor- mation cloud ” of sequence words, which for some genes may have significant regulatory impact. Extending the concept of cis - and trans -regulation, we propose to search for RNAs that are modulated by interactions with the transcriptome cloud and calling such examples nebula regulation. This framework has implications as a paradigm change for RNA regulation and provides a deeper understanding of nucleotide sequence structure and -omic language meaning.


Introduction
The concept of resilience is receiving increasing attention in chronic stress-related disease conditions. Resilience has been shown in clinical studies to play a protective role in patients with chronic disease conditions including osteoarthritis, breast and ovarian cancer, diabetes, and cardiovascular disease. The purpose of this study is to explore the relationships between RNA-RNA interactions and to devise a related measure of resilience from network properties of the whole transcriptome.

RNA physiology
At various levels, RNA is processed by alternate mechanisms [1], suggesting a biological framework that supports important system network features such as resilience. Trafficking of RNAs is essential for cellular function and homeostasis, but only recently it has become possible to visualize molecular events in vivo. Analysis of RNA motion within the cell nucleus has been particularly intriguing as they have revealed an unanticipated degree of dynamics within the organelle [2]. Single-molecule RNA imaging methods have revealed that the intranuclear and cytoplasmic trafficking occurs largely by energy-independent mechanisms and is driven by diffusion. RNA molecules undergo constrained diffusion, largely limited by the spatial constraint imposed by chromatin and chromatin-binding proteins if in the nucleus as demonstrated in numerous studies. In the cell, transcripts move by a stop-and-go mechanism, where free diffusion is interrupted by random association with cellular structures [3]. The ability and mode of motion of RNAs has implications for how they find nuclear targets on chromatin or cellular sub-compartments and how macromolecular complexes are assembled in vivo. Most importantly, the dynamic nature of RNAs is emerging as a means to control physiological cellular responses and pathways [4]. For example, unexpectedly complicated nuclear egress and nuclear import of small RNAs is more common than previously appreciated [5].
Much attention has been focused on noncoding RNAs and their physiological/pathological implications [6]. This focus in RNA research is ultimately directed toward understanding the regulation of protein-coding gene networks, but ncRNAs also form well-orchestrated regulatory interaction networks [7]. For example, computational prediction of miRNA target sites suggests a widespread network of miRNA-lncRNA interaction [8]. Others suggest the possibility of widespread interaction networks involving competitive endogenous RNAs (ceRNAs) where ncRNAs could modulate regulatory RNA by binding and titration of binding sites on protein coding messengers [9]. Cellular uptake and trafficking of RNA could be widespread [10]. As the number of experiments increases rapidly, and transcriptional units are better annotated, databases indexing RNA properties and function will become essential tools to understand physiologic processes in the transcriptome.

Biological-omic information theory
Much of bioinformaticians sequence analyses focuses on methodologies based on string alignment algorithms. However, such approaches fail to discover genomic aspects of systemic nature regarding dynamics or resilience. An alternative framework is based on alignment-free methods of genome analysis, where global properties of genomes are investigated [11]. A key concept of informational analysis is that of probability distributions. A genomic, or in our case transcriptomic, distribution associates to discrete values defined on transcripts, the number of times these values occur in a given transcriptome. The general concept of discrete probability distribution, called information source, was the starting point of information theory developed by Shannon [12]. Links between information theory and biology emerged from Shannon's Ph. D. thesis, titled "An Algebra for Theoretical Genetics" (1940), where the notion of information entropy was introduced [13]. For example, distributions of codons have shown characteristic properties that are linked to biological meanings, such as secondary structure free energy [14]. Other approaches based on the recurrence of genomic elements and on correlation structures in DNA sequences use mutual information, which plays a central role in the mathematical analysis of message transmission. Dictionary-based methodologies analyze sequences through properties of collections of words. Dictionaries are concepts from formal language theory, probability, and information theory that provide new perspective which may uncover the physiology of internal transcriptome structures.

Methods
We formally define the transcriptome as an information structure, and then construct several simple models as examples. The most realistic model is used to examine real datasets of partitioned RNAs for validation of framework.

Transcriptome information theory structure
RNA sequence is abstractly represented as a string over the nucleotide alphabet R ¼ {A, C, G, U}. This can be extended to modified nucleotides with an extended alphabet R ffi {A, C, G, U, N}, such that symbol N represents a modified nucleotide. W k denotes a set of alphabet letters of length k, called k-mers and W denotes the set of all possible nonempty strings over the alphabet R. Given a transcript string S ¼ s1, s2, … to sn, of length n, S[i, j] with 1 ≤ i ≤ j ≤ n is the substring of S from position i to position j (included). The length of S is |S| ¼ n. Substrings of S of length k are called k-words or simply words of S. In the following, the entire transcriptome is denoted by W based on k-mer dictionaries and entropies, which are aimed at defining and computing informational indexes for representative sets of transcriptomes. We assume that the complexity of a transcriptome increases with its distance from randomness, as identified by suitable comparison between transcriptomes of the same length. This framework provides clues about the appropriate k length to consider for analysis of transcriptome properties.

Spatial transcriptome information cloud (STIC) model construction
We hypothesized that miRNA localization in cellular compartments is an emergent property from Brownian motion interactions of a cloud of RNA sequences and RNA-binding proteins that can be analyzed in W [15]. There k-mer words of miRNA functional size were added to a dictionary from sliding windows of transcript sequences S. A prediction from this cloud model is that anomalous diffusion can occur if random-walk transcripts interact with their surrounding scaffold as a stochastic semantic cloud, and if the cloud relaxation time is a longer time frame than transit [16]. We showed that RNAs with sequences similar to the whole transcriptome exhibit modified or enhanced transport compared to RNA sequences without similar sequences [17]. Thus, RNAs were found to partition into different cellular compartments based on a semantic similarity of word compositions within W k . We determine the frequency of all k-mers in the transcriptome as a matrix composed of RNA sequences and their word copy levels. For each transcript, we count the number of k-mer words in common within W k or dictionary as a semantic similarity measure to the transcriptome, and we can also able to compare such counts to randomized W k sequence words.
Model assumptions are: (1) RNA diffuses away from point of transcription creating a cloud of k-mer sequences. (2) All RNAs comprise the transcriptome, and each transcript is affected by local RNAs with effective interaction windows of some sequence word length k nucleotides (nt). We assume significant k-mer word size to be 3-22 nt, which is equal to the functional miRNA size at the high side, and down to below the size of the "core" sequence [18]. (3) The diffusion rate of individual RNAs depends on degree of (a) sequence similarity and (b) reverse complementarity of RNA words at that location in the STIC ( Figure 1). (4) Cloud dictionaries (collection of transcriptome word sets in W k ) change as function of distance from transcriptome site and cell state. (5) The cloud affects anomalous RNA diffusion that can give rise to an emergent and patterned behavior in the cell [19].
We model the RNA sequence word content of the transcriptome cloud as a function of distance from transcription site at the chromosome. RNA molecule diffusion in nuclear compartments would lead to cytoplasmic and extracellular localization of RNA if the transcript half-life is greater than its transit time. Calculations at arbitrary transit distances could be determined Figure 1. K-mer words classified as solvent-accessible or inaccessible. Note that in this framework, k-mer words are generated from a sliding window and not from a contiguous word segments as could be interpreted by the figures adjacent to word blocks. Transcriptome is a dictionary of N þ n words and their associated frequencies. Interaction of transcript with transcriptome affects diffusion from solvent-accessible bases (words). Transcript itself is a part of the transcriptome. from this model with a large set of partial differential equations modeling RNA mobility, but was described as computationally prohibitive [15] for any realistic sized transcriptome. Each sequence S would be dynamically modeled with a neighborhood sized number of RNA-RNA interactions. Instead, we pursue a thermodynamic approach based on the Fokker-Planck equation to quantify stochastic processes in liquid medium [20]. RNA interactions are assumed to function over sequence lengths encompassing small single-stranded and solvent-accessible regions.
Assuming smallest word in the cloud is 3 nt long, corresponding to the lower limit of size for a seed sequence in miRNA [18]. The upper limit for word size is set at 22 nt, corresponding to the size of a typical mature miRNA. Again, this is the same as the miRNA response elements (MRE) size in the simpler related ceRNA hypothesis by Salmena [9]. Instead, we determine the frequency of all words in the transcriptome as a matrix composed of RNA sequences and copy levels from RNA-seq datasets. For each transcript, count number of words in common with the cloud dictionary as a similarity measure to the transcriptome (tCount), and also count reverse complement words (rcCount) for RNA-RNA interactions. These raw counts can be multiplied by the frequencies of repetitive words in W to yield tWord ¼ tCount * tFreq and rcWord ¼ rcCount * rcFreq. As shown by Seffens [17], miRNAs with greater similarity to the transcriptome, i.e., greater tCount and tWord, are suggested to diffuse differentially based on spatial partitioning. In addition, greater intramolecular RNA-RNA interaction would be expected to hinder diffusion. This work proposes a general RNA sequence function that combines the influence of similarity with native (NAM) and reverse complementarity (RCM) measures as a cloud interaction function: C[W, NAM, RCM], such that cloud interactions increase with RCM, and decrease with NAM. Transcripts with low C would have "ideal solution" diffusion coefficients and found in cytoplasmic compartment, and those with greater C would be slowed by RNA-RNA interactions and hence enriched in nuclear or perinuclear compartments.

Accessibility of an RNA sequence word
For each component word of a transcript, determine whether it is expected to be in a singlestranded and solvent-accessible state (state "A"), or double-stranded or buried within the RNA molecule and is inaccessible (state "I"). For model calculations and preliminary studies (Model-1 W 1 discussed later), we assume all words are accessible in state "A," and the transcriptome is uniform within the cell (i.e., ignore distance r from transcription site). Construct a matrix W k (T, f A , f I , r) for each word size k, and populate the respective matrix with the component words of the transcriptome from RNA-seq reads such that S is the actual word sequence, f A is the frequency or number of accessible words of that sequence, and f I is the frequency or number of inaccessible words. Matrix W k then contains information of all transcript sub-sequences and is a representation of the spatial transcriptome information cloud in some volume elements of the cell fraction. Let the diffusion coefficient for a transcript be described [24]as D RNA [21]. Then the effect of interaction of that transcript with the cloud would yield where C is the cloud interaction term for molecule RNA exhibiting probabilities of RNA-RNA interactions as a function of the STIC represented by matrix W k at some position r in the cell.
For RNA expression data from the whole cell, r is ignored. In experiments from purified nuclei, r ranges from 0 to the radius of the cell's nucleus, r N . In experiments derived from the cytosol, r ranges from r N to the cellular membrane radius r C . Experimental data from extracellular vesicles will have r > r C and RNA half-life becomes important to consider as a factor. As a first approximation for the C function, we assume that the deviation from ideal D RNA scales as the number of reverse complement words (rcCount) in common with transcriptome W k and is measured by difference to the number tCount of words in common with W k , which normalizes for transcript size. We could also compare to ranCount, number of words in common with a randomized W k . Putting together, we have to normalize number of words, or alternately, where α is a scaling factor and is dimensionless. The reverse complement measure (RCM), which factors rcCount word frequencies by rcFreq, then subtracting the count of words in common with transcriptome (tCount) by the corresponding tFreq, is one of several possible measures for correlation to measured compartmentalization of individual transcripts from RNA-expression datasets. The content of W k (r) changes as a function of r due to changing concentrations of transcripts in the cell. Boundary condition on whole cell measurement from microarray or RNA-seq experiments would be assuming no export from the cell. If there are no reverse complement words in common between transcript S and W k , then C [W k , S] is zero and the diffusion of that molecule is ideal. As a first approximation for the C function, we assume that the deviation from ideal D RNA scales as the number (rcCount) of reverse complement words in common with W k and is compared by a difference to the number tCount of words in common with W k to normalize for transcript size. We could also compare to ranCount, number of words in common with a randomized W k . Reverse complement measure (RCM) factors word frequencies to assess transcript-cloud interactions that correlate to measured compartmentalization of individual transcripts [17].

Words that are solvent-accessible
The above model treatment assumed that all RNA sequences are available for reverse complementarity interactions. RNAs except for miRNAs typically have regions that are solventinaccessible and/or double-stranded, preventing intramolecular interactions [22]. mRNAs have more secondary structures or intra-strand base pairing than expected by chance [23]. We have determined the secondary structure of all RefSeq transcripts to predict single-stranded regions using RNAfold [24], while others have used RNA structure predictors (RNAplfold in Refs. [25,26]) in a pooling predictor using machine learning [27]. Additionally, nucleotide solvent-accessibility in RNA structures could be estimated by the neural network method of Singh [22] using models of window size 3 nt, which could be expanded to 5-9 nt windows for k length. Alternatively, accessible surface area can be calculated by a publically available program NACCESS [28] to refine the STIC transcriptome words to those populated from singlestranded regions only, along with confidence measures. Solvent accessibility estimates for each transcript word partition the frequency entries in the transcriptome matrix W k (S, f A , f I , r) by reducing f A in the amount that f I increases. Shifts of f A to f I could be caused by RNA-binding factors (RNA or protein) that cover a word in the transcript or the word in the transcriptome, or indirectly by binding to some other region of the RNA causing a cis-type of structural alteration leading to solvent inaccessibility ( Figure 2). Transcriptome cloud or nebula regulation is introduced here and is proposed to occur as an indirect result of some factor that changes f A /f I for some word that then alters a different interacting transcript's diffusion coefficient. Conversely, f I to f A shifts could be caused by the release of binding factors or conformational change leading to exposure of the particular word and to nearby target RNAs. Dictionaries with this dynamic accounting of the transcriptome are labeled T, instead of the simpler W word matrix.

Genome-wide profiling of in vivo RNA structure
Recent transcriptome-wide RNA structure profiling through application of structure-probing enzymes or chemicals combined with high-throughput sequencing promises in vivo RNA structural information availability. Resultant datasets provide opportunity to investigate RNA structural information on global scale for STIC development. The analysis of high-throughput RNA structure profiling data requires considerable computational effort and currently dataset are not readily available. StructureFold processes and performs analysis of raw high-throughput RNA structure profiling data [29], incorporating wet-bench structural information from chemical probes and ribonucleases to restrain RNA structure prediction via RNAstructure and ViennaRNA package algorithms. StructureFold is deployed via the Galaxy platform. Alternatively, structure-seq is a recent quantitative and high-throughput method that provides genome-wide information on RNA structure with single-nucleotide resolution [30]. The methodology can perform both in vitro and in vivo RNA structure-function determinations, with insights to RNA regulation of gene expression and RNA processing. Implementation of structure-seq begins with chemical RNA structure probing under single-hit kinetics conditions. Modified RNA is then subjected to reverse transcription using random hexamer primers, then reverse transcription executed until it is blocked by chemically modified residues. Resultant cDNAs are amplified by adapter-based polymerase chain reaction (PCR) which are subjected to high-throughput sequencing, subsequently allowing retrieval of structural information on a genome-wide scale. A single structure-seq experiment can provide information on tens of thousands of RNA structures in a matter of weeks. Ding et al. [31] used RNAstructure calculated for each of thousands of mRNAs a positive predictive value (PPV), which they use to compare relative frequencies of base pairing in vivo constrained RNA structures to in silico predicted RNA structure. They found that most mRNAs do not fold in vivo as to in silicopredicted structures, as evident from a broad PPV distribution. Interestingly, mRNAs of cold and metal ion stress-response genes folded in vivo significantly different from their unconstrained in silico predictions. These stressors are known to affect RNA structure and thermo-stability like melting temperature T m . Instead, genes involved with basic biological functions such as gene expression, protein maturation or processing, and peptide metabolic processes show little change in their in vivo-constrained and in silico-predicted RNA secondary structures. Ding speculated mRNAs related to cell maintenance and showing high PPV may have evolved to resist large conformational changes in order to maintain homeostasis, an idea suggesting RNA resilience. This bias may be detectable in our transcriptome model W. As genome-wide profiling of in vivo RNA structure datasets becomes easily available [32], the information can be incorporated into the STIC model by adding custom transform and load tools for each dataset.

Simplest models of transcriptomes
We consider simplest models of the transcriptome to examine some limits on the model parameters and functions. Component of a transcriptome are listed in Table 1.

Transcriptome model 0 (W 0 )
Simplest model considers a spatially uniform transcriptome composed of ribosomes, tRNA, and a minor number of mRNAs. Here, the distribution of transcripts is assumed uniform and the transcriptome model lacks spatial elements. Further, consider a transcriptome where ribosomes and tRNAs are all "A" for adenosine, and the mRNAs are also all "A." We can construct transcriptome model W 0 using transcript values found in Table 1 from Seffens [17], with sizes n and respective abundances. The number of nonunique words (of size k) would be nk þ 1 for each transcript. Using sizes and respective abundance values in Table 1 [ The next more realistic model is composed of eight real human RNA transcripts comprising a simple representation of the transcriptome in a cell ( from four of the most prevalent human tRNAs with lengths of n ¼ 71-73 nt, and four of the major subunits of the eukaryotic ribosome with sizes from n ¼ 121 to 5034 nt, with the total number of nucleotides N being the sum of the nucleotides in each transcript, or N ¼ 7470 nt. Then the frequency of words with length k that are contained in each transcript is a subset of the number of possible k-mer words which is nk þ 1. In Model-1 labeled as W 1 , for each word length from k ¼ 3 to 22, word count was calculated along with the sum of the frequencies of those words extracted from the simple eight RNA transcripts. The intermediate output from program TIC-generator (for transcriptome information cloud generator, described in Ref. [17]) listed all k-mer words contained in each transcript, together with their frequency of occurrence. These lists from the eight rRNA and tRNA transcripts were combined, and then duplicate words resolved to form dictionary W k 1 . With the total possible number of words of length k ¼ 4 k , the fraction of all the words actually present in W k 1 decreased for increasing word size [17].
It is interesting that the peak in unique and total duplicate (blue diamonds in Ref. [17]) words is maximal at the same size as the miRNA "seed" sequence as defined in Ref. [18].

Randomized transcriptome of Model-1 (W 1-R )
We ran TIC-generator with shuffled-sequence transcripts labeled Model 1-R. Base composition of Model-1 transcriptome is 1341 "A," 2320 "C," 2519 "G," and 1291 "T," or 18% "A," 31% "C," 34% "G," and 17% "T." Using a random letter generator, we assembled four random transcriptomes with the same transcript length for the eight sequences and equal Model-1 base composition. We examined mostly word lengths k of 7 and 8 in preliminary studies shown below.

Real model validation
As a validation of this transcriptome model framework, we utilized the simple transcriptome model version (simple model W 1 ) that used real highly expressed genes, and for comparison separately, randomized sequences of that transcriptome (W 1-R ). This simple realistic model is composed of only eight real human RNA transcripts as a basic representation of the transcriptome in a cell. Experimental validation of the basic model transcriptome for k-mers considered various trial functions of semantic word similarity and reverse complementarity, which were calculated using published data sets. For example, trial functions evaluated include tWord for transcriptome words in common with target multiplied by respective word frequency in W k . A total of seven RNA studies, with data sources grouped into high and low study parameter sets, were statistically analyzed by mean values and t-test calculated as two-tail t-test under two-sample equal variance assumption models ( Table 2). Validation for the STIC model examined various functions of reverse complementarity using these published data sets. Here, we assume that appearance in exosomes or microparticles requires greater mobility and hence larger diffusion coefficients than cytoplasmic or nuclear RNAs [17]. Several functions tested include tWord for transcriptome words in common with target multiplied by word frequency in the transcriptome, rcWord (reverse complement k-mer words in common times frequency), RCM ¼ rcWord -tWord, reverse complement count (RCC) measure ¼ tCount -rcCount, Z-RCC as a zscore of RCC compared to four randomized transcriptomes Model 1-R, Z-RCM as the z-score of RCM, RCC-Ran which subtracts the value computed from 1-R and finally (RCC-Ran)/Len which is normalized for sequence length. The first five studies examined miRNA, while the Chen [33] and Friedel [34] studies measured mRNAs. Description of data sources that were grouped into high and low study parameter sets, with mean values and t-tests calculated detailed in sections below.

Model 1 validation with miRNA from exosome datasets
The Villarroya-Beltri [35] work reports on microarray datasets of exosome and cellular fractions from activated and resting human T lymphocyte cultures. They differentially assessed whether RNAs are specifically enriched within exosomes by performing microarray analysis of activation-induced variations in mRNA and miRNA profiles from primary T lymphoblast and their secreted exosomes. Data found in their supplementary data and also data publicly available at gene expression omnibus as Gene Expression Omnibus (GEO) Series accession number GSE50972 were used for Table 2. They showed that for most cases, miRNAs modulated upon activation are differentially found in cells and exosomes for either upregulated or downregulated miRNAs. This suggests that mRNA and miRNA loading into exosomes is not a simple passive process. Specific miRNAs were more highly expressed in exosomes than found in the cells, and in most cases this difference is preserved under cellular resting or activated conditions. Similarly, most miRNAs that are preferentially found in cells than in exosomes also keep this tendency regardless of the activation state of the cell. As such, Villarroya-Beltri classified some miRNAs as specifically sorted into exosomes (labeled EXOmiRNAs), whereas others are specifically retained in cells (as CLmiRNAs). We calculated tCount and rcCount as a count ( Table 2 in Ref. [17]), and tWord and rcWord, the latter which factor the expression level of that word. Other measures compared counts and words to a randomized transcriptome (RAN). We used a word size k ¼ 7 roughly equal to the miRNA seed sequence length [17]. Values of rcWord (mean 10.31) were lower than tWord (mean 12.45), and hence RCM and RCC were more negative for exosomes compared to cytoplasmic miRNAs. This supports the STIC model since exosome transcripts must diffuse further than cytoplasmic (CL) RNA, so avoid reverse complementarity. In summary, all trial measures calculated from this dataset showed significant support for the transcriptome model except for Z-RCM.

Model 1 validation with nuclear-enriched miRNAs
Park et al. [36] study compared microarray analysis of cytoplasmic and in this case nuclear fractions of hct116 colon cancer cells. They identified various miRNAs that exist in isolated nuclei from miRNA profiles correlated between cytoplasmic and nuclear fractions from multiple microarray analyses. Nuclear confinement of the mature form of miRNAs was validated by controlling reverse transcriptase RT-PCR conditions excluding the presence of precipitate forms of miRNA (e.g., as pri-miRNA or pre-miRNA). They found that elevated levels of representative miRNAs in purified nuclei support the idea that significant numbers of mature miRNAs survive not only in the cytoplasm but also in the nucleus. We sorted their data by N/C ratio and *partitioned these data into two groups: N/C > 0.47, which was nuclear-enriched (45 samples), and N/C < 0.47, which was preferentially found in the cytoplasm (33 samples). We found that tCount was 4.02 for nuclear-enriched, and 5.00 for cytoplasmic, with a t-test p-value of 0.116 between the groups; while tWord was 4.73 for nuclear and 10.58 for cytoplasmic miRNAs, with a significant t-test p-value of 0.023 between nuclear and cytoplasmic groups. We also found nuclear-enriched miRNAs have higher rcWord values compared to cytoplasmic miRNA (p-value ¼ 0.021 in Table 2), suggesting those transcripts have greater potential to interact with other transcriptome RNAs and hence may have lower than expected diffusion coefficients. The other evaluated measures did not show significance between groups.

Model 1 validation with additional RNA studies
Huang et al. [37]  Notes: Double-asterisk cells have significance below 0.05, while single-asterisk cells have significance below 0.10 but above 0.05. Cells with "nc" were not calculated from randomized transcriptome.  Table 2 were significant measures for data sets in that study.
Pursuing in-depth understanding of the mechanism supporting selective exportation of miRNAs to extracellular vesicles (EVs), Guduric-Fuchs [39] employed next generation sequencing to discriminate global expression patterns of small RNAs in HEK293T cells and the EVs that they released. Enrichment of overexpressed miRNA in EVs was measured by RT-qPCR in HEK293T cells, mesenchymal stem cells, macrophages, and immune cells. We sorted data from Guduric-Fuchs by EV/cell ratio, then compared the top 10 (exosome-enriched) and bottom (cytoplasmic enriched) miRNAs by evaluating the measures listed in Table 2. Only trial functions Z-RCC and RCC-RAN were significant from this dataset. Overall from using EV/cell in various measures examined across the studies, tWord and tCount (from Ref. [17]), along with their difference (tW-tC), have values that progress from lower for nuclear, higher for cytoplasmic, and highest for exosomal miRNAs. Therefore, we consider under transitivity, EXO > CL > NUC for these transcriptome measures of similarity. This supports the notion that miRNAs with sequence similarity to the overall transcriptome can random-walk furthest from their points of transcription if the secretion mechanism requires a great distance to travel. These conclusions on trial functions are most significant with the tCount measure, with a pvalue close to zero for the Villarroya-Beltri study, and 0.016 for the Guduric-Fuchs study, while the Park study showed little difference (p-value ¼ 0.122) for tCount between nuclear and cytoplasmic enrichment.

Word count normalization from RNA-seq datasets
Normalization is a crucial step in the analysis of RNA-seq data and has a strong impact on the detection of differentially expressed genes sought to validate the STIC model. Several normalization strategies have been proposed to correct for between-sample distributional differences in read counts, such as differences in total counts (i.e., sequencing depths), and within-sample gene-specific effects, such as gene length or GC-content effects [40]. Global-scaling normalization adjusts gene-level counts by a single factor per sample, such as the per-sample total read count, or reads per kilobase of exon model per million mapped reads (RPKM), or some housekeeping gene count. Statistical corrections by a quantile per-sample count distribution or other robust summaries obtained by relating each sample to a reference sample (e.g., trimmed mean of M values (TMM) and methods of Anders and Huber [41]). Although there have been efforts to systematically compare normalization methods [42], this important aspect of RNA-seq analysis is still not fully resolved. When data arise from complex experiments as in Section 2 above, involving cell fractionation, low-input RNA or different batches and read lengths, there may be more to correct for than differences in sequencing depth, referred to as unknown nuisance technical variation error. One methodology correction is the addition of spike-in controls within the normalization procedure [43]. Control designs have been successfully employed in microarray normalization, for miRNA and mRNA arrays [44]. Negative controls in the normalization procedure test the assumption that the majority of genes are not differentially expressed between study conditions. This assumption can be violated when a global shift in expression occurs between conditions, such that control-based normalization may be necessary for technical variation, and a global mean read for global differences in RNA levels.

Spatial and temporal localization
We follow with a description of possible experimental data sets for populating transcriptome model in W. RNA-seq data sets would be the preferred source for fine structure of word contents, but microarray expression data could also be used for overall population of W.

Spatial localization by RNA imaging
The only method that provides insight into both the level and localization in single cells is in situ hybridization (ISH), which has increased considerably in importance in RNA research. ISH along with multiplex RNA profiling (MERFISH) can be used to measure the degree of associations among transcripts. Numerous RNA species have been identified, counted, and localized in single cells using MERFISH, a single-molecule imaging approach that uses combinatorial labeling and sequential imaging with an encoding scheme capable of detection and/or correction of errors. This multiplexed measurement of individual RNAs can be used to measure the gene expression profile and noise, along with covariation in expression among different genes, and spatial distribution of RNAs within single cells.

Localization of small RNAs
For miRNAs, ISH is exceptionally challenging because of miRNA features such as small size, sequence similarity among various miRNA family members, and low tissue-specific or development-specific expression levels. Standard ISH protocols can be modified to improve miRNA detection [45]. Locked nucleic acid (LNA/DNA) probes have great utility in miRNA detection because of short hybridization time, high efficiency, discriminatory power, and high melting temperature of the miRNA/probe complex [46]. Minimal length of LNA/DNA probes was found to be 12 nt with probes usually containing 30% LNA nucleotides [46]. A mixture of 2 0 -OMe RNA and LNA modifications in a 2:1 ratio resulted in improved specificity and stability of the probe/RNA duplex in comparison to LNA/DNA probes [47]. Experiment specificity was found to be further improved by lengthening the probe length to 19 nt [48].

Localization by MERFISH
Chen et al. [33] used array-synthesized oligopools as templates to make encoding probes in the MERFISH protocol. An oligopaint approach developed by Beliveau et al. [49] can generate a large number of oligonucleotide probes to label chromosome DNA. Inspired by this approach, Chen et al. [33] designed a two-step labeling scheme to encode and read out cellular RNAs. They labeled a target set of cellular RNAs with a set of encoding probes, each probe comprising a RNA targeting sequence and two flanking readout sequences. Four readout sequences were assigned to each target RNA species based on error-correction optimized code words.
They identified these readout sequences with complementary FISH probes via rounds of hybridization and imaging; each round using a different readout probe. To increase the signal-to-background ratio, each cellular RNA is labeled with $192 encoding probes.

RNA diffusion
Brownian effects are ubiquitous in numerous examples of soft condensed matter physics [20] in which the system can be modeled as a set of interacting degrees of freedom in contact with a heat reservoir. Brownian motion plays an important role when one infers macroscopic behaviors from mesoscopic levels of description, frequently a desire in the study of complex systems. Dynamics at the mesoscopic level is governed by a set of Langevin processes or equivalently by the corresponding N-particle Fokker-Planck equation. This scheme applies nonequilibrium thermodynamics to derive the kinetic equations describing the evolution of an N-particle probability distribution function [20]. One then considers a system of N Brownian particles diluted in a solvent, which acts as a thermal reservoir. Particle velocities are then modeled as internal thermodynamic variables and permit an analysis in the phase space of the Brownian particles. A local equilibrium hypothesis constrains the phase space level and from it one derives the thermodynamic entropy balance equation. Entropy production accounts for irreversible processes taking place in the phase space, then quantifying fluxes and forces can be done in a similar manner as in the thermodynamics of irreversible processes [20]. A general thermodynamic treatment of systems of N interacting Brownian motion particles as described by Fokker-Planck equations is detailed by Savel'ev et al. [16].

Resilience as a systems biology measure from transcriptome model
Development of a resilience measure from transcriptome RNAs could improve basic knowledge of the transcriptome and responses to stress. Transcriptome size and overall variation have been documented across cell cycle stages, tissue types, developmental stages, diurnal cycles, sexes, and environment [50]. Despite the ubiquity of transcriptome size variation, its potential to introduce systematic bias into expression profiling has been largely overlooked and this study uncovers responses of the transcriptome to stress.

Formalization of metric for resilience in biological systems using STIC metrics
Insight into structural determinants of robustness and resilience can guide the understanding of systems that go through transitions. Systems engineering research has developed methodologies to measure the functionality and complexity of engineered systems for designing and assessing system resilience. While system functions, resilience, functionality, and complexity are widely used concepts in systems engineering, there is significant diversity in definitions and no unified approach to measurement in the systems biology area [51]. One method for measuring impacts on functionality in dynamic engineered systems is based on changes in kinetic energy [52]. This metric can be applied at particular levels of abstraction and system scales, consistent with the established multiscale nature of biological systems.

Measuring complexity
A difficulty in complexity theory is the lack of a clear definition for complexity, particularly one that is measurable [53]. Underlying cause for this lack of a unified complexity definition is that there are numerous conceptual types of complexity. The first formal treatment of complexity focused on algorithmic complexity, which reflects the computation requirements for a mathematical process [54]. Senge [55] and Sterman [56] expand the scope of definition to include dynamic complexity, which is primarily characterized by difficult-to-discern and hard-to-measure cause-effect relations. A recent workable definition is that of thermodynamic depth, which essentially asserts that complexity is a "measure of how hard it is to put something together" [57]. Several variations on this approach share the commonality that complexity should disappear for both ordered and purely stochastic systems [58]. Additionally, Bar-Yam [59] defined complexity as the length of the shortest string that can represent the properties of a physical system. This string could be the result of measurements and observations over time.
An energy-based metric was proposed by Chaisson [60] measuring the energy rate density, where Φm is energy rate density, E is energy flow through a system, τ is the time frame, and m is system mass. Chaisson obtains results that correlate well with other notions of complexity, and below we add our proposed relation from this transcriptome model framework A practical difficulty in using the Φm metric is determining the appropriate mass and energy.
In measuring the Φm of a transcriptome, we can use the mass of RNA production and the total energy processed by the system. Energy in this framework could be the total sum of all possible RNA-RNA interactions, which is just the count of all NAM and RCM in W as a sum of overall transcript sequences S. However, the total energy of a transcriptome does not flow just through its cell, but also exported to the extracellular space and captured from that external source of transcripts, the mass of which is difficult to measure.
While higher functionality can be associated with increased resiliency and robustness, the concepts are not synonymous. As defined by the INCOSE Resilient Systems Working Group, "Resilience is the capability of a system with specific characteristics before, during, and after a disruption to absorb the disruption, recover to an acceptable level of performance, and sustain that level for an acceptable period of time" [61]. Robustness is the ability of a system to reject disturbances without altering its state. A system is robust when it can continue functioning in the presence of internal and external challenges without fundamental changes to the original system. In relation to previous section on energy availability, robustness is the ability for a system to retain reachable states in the event of falling available energy.

Framework for measuring resilience
Instead, complexity in the presented framework can be derived from properties of W or T as in Figure 3. Consider a transcriptome from a cell type alpha to be represented as T α , such that it is the sum of all RNAs, including mRNA, miRNA, lncRNA, and rRNA within the cell ( Table 1). This set is the result of transcripts produced from the cellular DNA, T α 0 , transcripts captured from the extracellular space (EC) in the form of microparticles and exosomes T EC IN , and depletion as microparticle or exosome export to the extracellular space with T α OUT . Or, where F is a filter function with parameters S (transcript sequence), RC (reverse complement of transcript sequence), n sequence length of S, and "a" is a fitting parameter with suitable dimensions, derived from: F α NAM/(RCM * n) proportionality. Thus the extracellular pool is composed of transcripts with greater similarity S, and less reverse complementarity RC to the transcriptome of origin and also have smaller size n. The filter functions f S (S) and f RC (S,RC) operate on sequences S and RC, and essentially is a semantic selection filter on transcripts by affecting diffusion. We propose that resilience of the cell is proportional to size of the transcriptome filter F, then resilience α |F|, where |F| ¼ |f S | þ |f RC |, or normalized for transcriptome size, such that |f S | is sum of all similarity matches, |f RC | is sum of all reverse complement interactions, and N is the total nucleotide size of the transcriptome. . Framework for deriving transcriptome interactions and resilience. Data source is from RNA expression experiments using RNA-seq or microarray values, or randomized sets for controls. From input sets, the aligned gene ID and frequency of the extracted words are populated into a dictionary. Gene ID is used to calculate solvent-accessible (A) and inaccessible (I) word probabilities from full length transcripts in silico. The dictionary can be queried for any sequence S to find probability distribution of S in the dictionary. Changes in the transcriptome will change the distribution due to changes in A/I for affected words. Overall metrics of the dictionary measure resilience using Eq. (8) in the text.

Discussion
The concept of resilience is receiving increasing attention in chronic stress-related diseases. Resilience has been shown in clinical studies to play a protective role in patients with chronic conditions including osteoarthritis, breast and ovarian cancer, diabetes, and cardiovascular disease related to psychosocial dimensional levels. The purpose of this study is to explore the relationships between RNA-RNA interactions and to devise a measure of resilience at the cellular level.

Prospects, challenges, and limitations for resilience measure by variance in RNA-seq
Although research on empirical indicators of robustness and resilience is rudimentary, there is already a fast-growing body of engineering modeling as well as empirical work in ecology. Nonetheless, major challenges remain in developing robust procedures for assessment of the transcriptome. A goal of systems biology is to analyze large-scale multidomain networks to reveal relationships between network structures and their biological function. While generally, it is not feasible to visualize and understand whole networks, a common analysis is to partition the network into subnetworks responsible for specific biological functions. Since biological functions can be carried out by particular groups of molecules, dividing networks into naturally grouped clusters can help investigate the relationships between function and topology of system networks or reveal hidden knowledge behind them. The expression in Eq. (8) for resilience is a measure of the size of network interactions possible within a transcriptome.

Notion of the transcriptome as an information system
The body of this work considers the transcriptome as an information system modeling a dynamic system. A dynamic system is characterized by two concerns: the static structure and dynamic behavior. The structural elements of dynamic systems are those elements which may be identified from static snapshots of the problem space; while dynamic aspects involve those semantic elements of the system that exist over the time domain. While modeling the static aspects of an information system like RNA expression data, an understanding of the dynamic nature of information systems in the cell is low. Behavioral issues of large information systems are usually complex, consisting of many interactive sessions with the outside environment, tasks like coordination and collaboration among different entities. Dynamic systems can exhibit emergent properties that result from the dynamics, and which cannot be attributed to static structural factors. However, given any real world information system consisting of many multistream interactive processes, emergent properties are usually complex, without a common characteristic structure. Such emergent properties are beginning to be addressed with the transcriptome.

Conclusion
We show that the transcriptome can be modeled as an information system with emergent dynamic properties. The term nebula regulation is introduced to consider the regulatory effects of the whole transcriptome acting locally through RNA-RNA interactions and shifts between accessible and inaccessible stretches of RNA sequence. Described as a network of interactions from semantic analysis of similarity and reverse complementarity, together with the size of a transcript, affect the diffusion of transcripts in a cell, and hence the distribution of RNAs. A measure to represent resilience is proposed as the sum of the component elements (similarity, reverse complementarity, and normalized by total nucleotides) of this transcriptome filter.