Distribution of number of CDS per transcript.
The novel powerful technique is used for a study of combinatorial and statistical properties of transcriptome sequences. The main approach stands on the study of distribution of nucleotide triplet frequency dictionaries obtained from the conversion of transcriptome sequences. The distribution is revealed through PCA presentation and elastic map technique. The transcriptomic data of Siberian larch (Larix sibirica Ledeb.) and Siberian pine (Pinus sibirica Du Tour) were studied. The transcriptomes exhibit unusual symmetries. The octahedral structure exhibiting rotational symmetry in transcriptome contig distribution was found for L. sibirica, while mirror symmetry was found for P. sibirica. The octahedron structure seems to be universal for plants.
- Chargaff’s parity
- mirror symmetry
- rotational symmetry
A discovery of an order and new structures in genetic entities is an up-to-date scientific problem. Indeed, the amount of primary genomic data shows the daily growth for billions of megabases. The symbol sequences from four-letter alphabet (with few variations in some nucleotide sequences; say, substitutes in RNAs).
We studied an order and structuredness over a set of sequences representing the transcriptome of Siberian larch (
There are two approaches to discuss structuredness in a set of symbol sequences (transcriptome nucleotide sequences, in our case). The first implies that one seeks for inhomogeneities in the mutual distribution of the sequences form the ensemble under consideration. Of course, to do it, one must introduce a metrics to measure the difference between any two sequences; there are various ways to do it [2, 3, 4]. An alignment might be such a measure [5, 6] (see also much more prominent approach presented in [7, 8]). Alternatively, the second approach implies the search for inhomogeneities within a sequence, e.g., through the comparison of the formally identified fragments of a sequence.
Regardless the specific approach to seek for structuredness, one must introduce a way to measure the difference between the objects to be analyzed. Alignment [9, 10, 11] is the most widespread approach here. An alternative idea to search a structure and order in symbol sequences is to transform them into frequency dictionary [12, 13, 14, 15]. A frequency dictionary could be defined in various ways, but basically it is a list of all the strings of a given length accompanied with a frequency of each string (a detailed description is given below). A transformation of a symbol sequence into a frequency dictionary provides a mapping of a set of sequences into a metric space. Hence, one may apply all the tools for analysis.
As soon, as a structure in ensemble of sequences, or over a sequence is defined, the question arises toward the properties of those structures. Probably, symmetry of such structures is the most fundamental and basic one. Again, there could be various notions of the symmetry. The first concept of the symmetry aims to figure out structures that seem to remain similar, when some simple transformations in a proper space are provided. First of all, a rotational symmetry of a cluster structure [3, 4] or mirror symmetry [16, 17] must be mentioned here.
Few words should be said toward the symmetry. Here we shall consider two notions of that issue. The first is a well-known rotational, mirror, or similar symmetry observed in the distribution of the contigs converted into triplet frequency dictionary as they are distributed in the relevant Euclidean space (where the triplets are the coordinates). The second issue is measured through the proximity (or deviation) to Chargaff’s parity rules, to be observed for various entities, both natural (these are contigs) and artificial (kernels or arithmetic means of the frequency of identical triplets counted over an ensemble of contigs).
2. Material and methods
2.1 Transcriptome nucleotide sequence data
The transcriptomes of Siberian larch and Siberian pine were originally sequenced under the project on the whole genome sequencing of Siberian larch [18, 19]. The sequence data of
2.1.1 L. sibirica bud transcriptome
For the purposes of our study, we have selected the bud transcriptome of
The total number of sequences in the transcriptome is 12,353 transcripts. The histograms of the distribution of the transcriptome sequence entries over their length are presented in Figure 1. Evidently, the distribution resembles Poisson distribution quite strongly. There are 7573 transcripts in the transcriptome bearing a single CDS (maybe in various directions). Four thousand thirty-eight transcripts have two or more CDS in them; the distribution of number of CDS in transcripts is shown in Table 1. Finally, in 742 transcripts no CDS have been found.
2.1.2 P. sibirica bud transcriptome
We used bud transcriptome from
There are as many as 426 transcripts with no CDS detected in them. Surprisingly, there are no transcripts in the transcriptome with CDS belonging to both strands, simultaneously. The distribution of number of CDS found in a transcript is shown in Table 1. On the contrary to
2.2 Triplet frequency dictionary
Triplet frequency dictionary is the list of all 64 triplets found within a sequence under consideration, where each entry (triplet) is assigned with the frequency of the triplet . The reading frame move could be chosen arbitrary and depends on the specific problem to be solved. Everywhere further we use or ; for we use the notation of , unless it makes a confusion.
A frequency dictionary unambiguously maps a sequence into a point in 64-dimensional metric space. Strongly speaking, with maps a subsequence into the point of the metric space, not the sequence entirely; further we shall discuss this point in more detail. Next, the dimension of the space is 63, not 64; this fact follows from the linear constraint:
This constraint allows to exclude any triplet from the analysis, thus changing 64-dimensional space for 63-dimensional, where all variables are linearly independent .
Formally speaking, any triplet could be excluded. Practically, one must eliminate the triplet with the least standard deviation figure determined over the set of frequencies under consideration. Indeed, suppose a triplet yields the standard deviation equal to zero, as determined over a set of dictionaries, it means, all dictionaries in the set have the same frequency, for this triplet: , (here enlists the dictionaries in the set). Such invariance makes the dictionaries (and the sequences standing behind) indistinguishable, from the point of view of the triplet. The choice of a triplet with minimal standard deviation for the exclusion provides the elimination of the variable contributing least of all in distinguishability of the entities.
2.2.1 Metric choice
The list of triplets accompanied with the frequency of each entry makes frequency dictionary ; let , at the moment. Hence, a dictionary is a point in metric space; obviously, one may define metrics in a number of ways, in such space. For the purposes of further analysis, we use the Euclidean metrics:
Some other metrics might be used, as well. Here and index two different dictionaries (sequences, respectively).
2.3 Chargaff’s imparity index
To begin with, we bring to mind the well-known complementarity pattern established by E. Chargaff in 1952 [21, 22]; it consists in a strong equality of A’s and T’s numbers (C’s and G’s numbers, respectively) counted over DNA molecule. Of course, some minor violations may take place due to mutations; meanwhile the accuracy of this equality is very high. This fact is also known as the first Chargaff’s parity rule.
The second Chargaff’s parity rule stipulates that
if counted within a single strand. The accuracy of (3) is rather high but varies for different taxa.
Surprisingly, similar to (3) relations are observed for oligonucleotides counted over a single stand. Let us now introduce some rigorous definitions and notions.
Definition 1. Consider a string be an oligonucleotide of the length , where is nucleotide occupying the -th position.
Definition 2. Two strings and make the complementary palindrome, if they are read equally in the opposite directions, with respect to Chargaff’s complementarity rule:
Hence, , . Here are some examples of complementary palindromes:
So, the generalized second Chargaff’s rule stipulates equality (or proximity, to be exact) of frequencies of two strings comprising complementary palindrome [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]. Surely, one hardly could expect to get the absolute equality of the frequencies of any two strings comprising complementary palindrome. There is a number of reasons standing behind the violation of such absolute equality; they range from purely combinatorial [25, 26, 27, 34] and/or finite sampling effect to biological peculiarities [24, 28, 30, 33].
To reveal the difference between genetic entities or biological objects, one must introduce a measure of the violation of the generalized second Chargaff’s rule; one may do it in various ways; we use the discrepancy index:
Here is the set of strings of the length observed in two sequences (and , respectively), enlists all the strings, and is the string complementary palindromic to . Normalization factor is introduced to equalize the figures (4) observed for various .
The index (4) measures the discrepancy between two dictionaries (and ). Meanwhile, this index could be applied for a single frequency dictionary :
Here the complementary palindromic couples are combined from the strings belonging to the same frequency dictionary .
The discrepancy measure (4) looks like Euclidean distance, while it is not. More exactly, it could be considered as a metrics in Euclidean space. To do it, one must reconsider a point in a couple, changing it for the dual one that is a complementary palindrome.
The inner discrepancy measure (5) definitely is not a distance, since it characterizes a single object, not a couple.
2.4 and dictionaries
This is a very common fact that a genome comprises coding and noncoding regions. Basically, they differ in the statistical properties manifested in triplet frequency dictionaries. One might detect some minor difference in composition developed for coding vs. noncoding regions. Significantly greater difference between these two types of genome parts is observed for dictionaries [2, 3, 4].
Dictionary is uniformly defined, for any sequence. The situation differs for dictionaries. Consider a sequence of the length . Starting to cover the sequence with the frames of the length 3 moving along the sequence with the step 3, one may get three different dictionaries, in dependence to the location of the start point. The starts may be located at the first nucleotide of a sequence, at the second nucleotide, and at the third nucleotide; thus, three different triplet frequency dictionaries could be obtained.
The key difference between coding and noncoding regions consists in the deviations between these three dictionaries. In other words, let the sequence falls entirely into a noncoding region of a genome. One may develop three triplet frequency dictionaries , corresponding to three positions of the reading frame shift (these are 0, 1, and 2). The key issue is that these three dictionaries:
Differ significantly if developed for coding and noncoding regions.
Differ each other, if developed for a coding region.
Differ between them negligibly, if developed for a noncoding region.
In other words, consider a set , developed over a noncoding region and a set , developed over a coding region. Then, the difference between is rather small, when expressed in any way (as Euclidean distance, entropy, mutual entropy, etc.; see also [7, 8]), but the difference between is significantly greater. Besides, the difference between and manifests apparently. These deviations in statistical properties of such triplet frequency stand behind the
We shall explore structuredness in transcriptomes through the analysis of those triplet dictionaries developed over the individual transcripts.
2.5 Relative phase
To reveal the inner structuredness of a (bacterial) genome, Gorban and coauthors have introduced special construction that might be called
A subsequence identified by a specific tile is then converted into frequency dictionary , and the inner structuredness of a genome is represented through the distribution of the points corresponding to tiles, in 64-dimensional (or 63-dimensional) metric space.
This structuredness is basically determined by the so-called
Fall completely into a coding region.
Fall completely outside a coding region.
Contain a border between coding and noncoding regions.
In any chance, the relative phase indicates whether the start of a tile coincides with a start of a coding region or not. There are following combinations determining the relative phase index:
Start of a coding region coincides to the start of a tile. In this case relative phase .
Start of a coding region does not coincide to the start of a tile, and the reminder of the division of the distance (expressed in number of nucleotides) from the start of the tile, and the start of coding region is 1. Then in this case.
Finally, the start of a coding region falling inside the tile does not coincide to the start of a tile, and the remainder is 2. Then in this case.
For any tile covering a noncoding region, , by definition.
It should be stressed that genes (or coding regions) may take place in opposite strands; in such capacity, the relative phase index must be defined for leading strand and lagging one, separately, where the remainder of the division must be determined for the difference between the last symbol of a tile and the last nucleotide of a gene annotated in a sequence as located in the lagging strand. Thus, seven figures of the relative phase index are possible: , , and for the tiles containing coding regions from the leading strand; , , and for the tiles containing coding regions from the lagging strand; and, finally, labeling the tiles covering noncoding regions, only.
2.5.1 Transcriptome relative phase
The situation is slightly different for transcriptome (and the transcriptomes of
Each frequency dictionary corresponding to a specific transcript was labeled with relative phase index; the labeling procedure was pretty close to that one described above, with few exceptions. We used
The relative phase index for transcripts containing a single CDS was determined in completely the same way, as described above. The transcripts bearing no CDS, if any, have been labeled with index . Finally, the problem arose from the transcripts bearing several CDS: obviously, a relative phase index is defined ambiguously for such transcripts. In such capacity, we labeled the transcripts with multiple CDS with special figure of the relative phase index.
Finally, we have calculated the standard deviation for each triplet, over the entire set of transcripts; that is with , so we excluded this triplet from the set of variables to cluster the transcripts. Reciprocally, the triplet with yields the maximal figure of the standard deviation.
Similar figures determined for
Previously, seven cluster symmetric patterns have been reported [2, 3, 4], in bacterial genomes. Later, similar (but not equivalent) structures were found in chloroplast genomes [16, 17]. First of all, the tiles corresponding to specific relative phase tend to aggregate into clusters apparently seen in the projection into three principal components with the largest eigenvalues. The points corresponding to specific strand (either leading or a lagging one) perform a triangle, in the frequency space; the points corresponding to noncoding regions tend to gather into a ball-like structure located in the central part of the pattern.
The patterns described in [2, 3, 4, 16, 17] are provided by the interplay of two triangles and the central ball. The triangles comprise the points corresponding to specific strand. There are two basic symmetries found in these triangles: the former is a shift (rotational) symmetry peculiar for bacterial genomes [2, 3, 4], and the latter is mirror symmetry peculiar for chloroplasts [16, 17]. The ball comprise the points corresponding to the tiles with noncoding regions inside (chloroplast genomes have one more cluster called
Whether a pattern would have four or seven clusters depends on content of a genome, for bacteria [2, 3, 4]. This figure almost completely determines the mutual location of the planes comprising the triangles formed by the clusters belonging to the same strand. There are some exclusions from this rule, for cyanobacteria. Chloroplasts exhibit mirror symmetry in the strand-specific triangles, so they always have a four-beam structure, where the triangles occupy the same plane with obligatory coincidence of and phases [16, 17].
3.1 Phase index coloring agreement
To make the presentation of results clearer, let us fix the color and label mark usage for transcripts to be shown in figures everywhere further. Indeed, we should distinguish eight different phases in the figures: , , , , , ,
To do that, we shall use the following labels: all phases of through of transcripts from the leading strand are marked with triangles; all phases of through of transcripts from the lagging strand are marked with diamonds;
Besides, the relative phases of single CDS transcripts are colored in the following manner: is purple triangle, is lime triangle, and is yellow triangle; reciprocally, is magenta diamond, is azure diamond, and is sand diamond.
We should say few words concerning the distribution of the transcripts with several CDS detected in them. For both transcriptomes, the distribution of such transcripts in the 63-dimensional space seems to be very homogeneous; in other words, these transcripts do not form any specific cluster, neither they are attracted to any other given one provided by the transcripts with specific (and unambiguous) relative phase index. The same is true for both studied transcriptomes. Later we discuss this point in more detail, while here we fix that the points representing such multi-CDS transcripts are erased from the pictures illustrating the results.
Thus, the clusters formed by transcripts of the same relative phase index are located in two parallel planes (in the space of three principal components with the largest eigenvalues). This observation holds true for
L. sibiricatranscriptome octahedron
Unlike the tiles developed for a genome, the transcripts of a transcriptome exhibit an ultimate pattern, that is, octahedron. The rectangular triangles, and , in Figure 2 occupy the position in two orthogonal planes. Note, these triangles do not comprise the clusters from the same strand; on the contrary, phases over the octahedron are distributed in the manner shown in Figure 2 (right).
Figure 3 shows the distribution of
The transcriptome shown in this figure exhibits clear and unambiguous octahedral pattern in cluster location. It is evident that to phases lay out in a plane and vice versa: the phases from the lagging strand are also laid out in a plane, and these two planes are parallel. It should be stressed that this pattern is observed in the metric space defined by the eigenvectors of the covariation matrix; in other words, the clear and apparent octahedron pattern is observed in affinely transformed space, not in the original one determined by triplet frequency.
Let us now consider the distribution of the points corresponding to
Also, this figure shows the distribution of the transcripts where no CDS have been found (brown circles). The cluster comprising these transcripts is rather remarkable: the transcripts where no CDS have been found behave themselves (in terms of clustering in 63-dimensional triplet frequency space) pretty close to the fragments falling completely into noncoding regions of a genome, when a complete genome is sliced into a set of tiles [2, 3, 4, 16, 17]. This observation indirectly (while rather hard) proves the total lack of any CDS in such sequences; otherwise, the corresponding frequency dictionaries never could be gathered in a ball centered at the pattern.
The transcripts with several CDS inside are distributed over the pattern almost homogeneously, including the central spot where the transcripts without CDS are concentrated. Apparently, this fact follows from the multiplicity of CDS in these transcripts: an interplay of different CDS located within a transcript may yield an effective value of its
P. sibiricatranscriptome octahedron
Let us now focus on the peculiarities of the transcriptome of
To begin with, Figure 5 shows the clustering pattern observed for this transcriptome; the technology of the development of the pattern is absolutely the same, as in Figures 3 and 4. The strongest difference between this transcriptome and the
At the first glance, the pattern shown in Figure 5 looks like a tetrahedron, while it is not. In proper projection, the pattern looks like a hexagon; adding the subset of multi-CDS transcripts, one gets the same pattern almost homogeneously covered by the point corresponding to the subset.
The patterns provided by the distribution of considerably short fragments of a genome may tell a lot to a researcher [2, 3, 4, 16, 17]. For bacteria, content seems to be the key factor determining the details of the pattern [2, 3, 4]. That is not so for chloroplasts, mitochondria, and cyanobacteria [16, 17]. The results presented above show that content has nothing to do with a pattern observed over a transcriptome. Hence, a question arises toward the key factor determining the specific type of a pattern. Yet, there is no simple and brief answer, while Chargaff’s parity rule discrepancy may be quite informative here.
We have determined Chargaff’s rule discrepancy measure (5) figure for all six clusters observed in
Let us now focus on a few more details on Chargaff’s imparity index, itself. The index value differs for different length
Calculating the index (4) over a single strand, one may clearly understand to what extend a strand looks like the opposite one, in terms of the word frequency [23, 24, 25]. For random non-correlated sequence with and . Hence, figures remain the same, if the discrepancy
Unlike figures, the radii of these six clusters exhibit quite diverse behavior. The radius of a cluster is an average distance from the center (that is arithmetic mean) determined over the cluster to each point from the cluster. Lower part of Table 1 shows the radii figures. The radii figures are apparently different, for the transcriptomes under consideration. and phases for
Inter-cluster discrepancy measure is of great interest, for both cases; Table 3 shows these indexes. Careful examination of Table 3 allows to identify three couples of relative phase indexes with distinctively lower figure of (4), namely, the couples:
Evidently, the phases in these couples yield two different types of symmetry: the first one is shift, and the second symmetry is mirror. The situation is opposite for
To make the situation with symmetries clear, we show the clusters over the elastic map shown in the so-called
Such mirror symmetry has been previously reported for chloroplast genomes [16, 17] (see also [23, 37, 38]); yet, there were no other but the chloroplast genomes exhibiting such mirror symmetry, and
Definitely, the coincidence of these two symmetrical patterns does not mean that
The most amazing thing in transcriptome statistical properties is that it yields an octahedral pattern, unlike bacteria, organelle, and other genetic entities (say, yeast genomes). Another point is that the pattern does not depend on the length of transcripts taken into consideration: we have examined separately the subsets of transcripts as long as bp, bp, and those longer 3000 bp. All these subsets yield similar pattern, with very minor variation mainly manifesting in cluster density.
Total absence of the (rather extended) noncoding regions.
Elimination of introns from the statistical analysis of sequences.
Of course, the first item from this list is quite arguable: a number of transcripts where no CDS has been detected bring a direct and unambiguous disproof of it. Thus, the question arises, whether these transcripts are similar, in some sense, to the fragments of genome comprising purely noncoding regions of the latter.
We have examined the first hypothesis through the simulation of noncoding regions. To do that, we have added a number of frequency dictionaries obtained from the tiles covering the noncoding parts of genomes of several other organisms. All the tiles were as long as 603 bp and contained noncoding regions, exclusively. The number of dictionaries (the points, in other words) varied from one third to one half of the total number of transcripts in the set. By assumption, this addition simulated a genome.
Upon addition, we expected to see a pattern similar to that one observed in bacteria, organelle, or other eukaryotic organisms; the octahedron pattern appeared to be stronger. Figure 8 obviously disproves this hypothesis: it shows the same transcriptome (
The impact of introns on the alteration of the observed pattern is less evident. Moreover, one faces greater difficulties in revealing it. One might want to compare the distributions developed over and dictionaries, in this case; yet, this problem needs careful investigation and falls beyond the scope of this paper.
Systematic comparison of (rather short) fragments of permanent length formally identified within a genome reveals a symmetry in the distribution of the triplet frequency dictionaries obtained over those fragments; originally this effect has been found on bacterial genomes. Later similar (while rather different in a number of essential details) behavior has been found for chloroplasts and mitochondria genomes. The general pattern of the distribution looks like a superposition of two triangles where the vertices correspond to the fragments of the same relative phase. In simple words, it corresponds to a reading frame shift, in case of a translation-like processing of DNA sequence.
A transcriptome itself might be considered as a set of those fragments, with few exclusions. Firstly, the lengths of transcripts are different and may affect the expected pattern. Secondly, there are no fragments in a transcriptome corresponding to those obtained from noncoding (intergenic) regions of a genome. This fact results in ultimate possible configuration of the clusters corresponding to the transcripts with the same relative phase index, that is, octahedron. All these patterns could be seen in the space of three principal components with the largest eigenvalues. The
The data used in this study were obtained under the grant 14.Y26.31.0004 from the Russian Government. The authors also thank Serafima Novikova from Siberian Federal University for the helpful discussion.
Conflict of interest
The authors declare no conflict of interest.