Classification of graphical representation methods.
Sequence comparison is one of the most fundamental tasks in bioinformatics. For biological sequence comparison, alignment is the most profitable method when the sequence lengths are not so large. However, as the time complexity of the alignment is the square order of the sequence length, the alignment requires a large amount of computational time for comparison of sequences of large size. Therefore, so-called alignment-free sequence comparison methods are needed for comparison between such as whole genome sequences in practical time. In this chapter, we reviewed the graphical representation of biological sequences, which is one of the major alignment-free sequence comparison methods. The notable effects of weighting during the course of the graphical representation introduced first by the author and co-workers were also mentioned.
- amino acid sequence
- binary image
- DNA sequence
Comparison between biological sequences is one of the most fundamental tasks in the area of bioinformatics. For relatively short sequences, such as nucleotide sequences of genes or amino acid sequences of proteins,
Since the seminal paper by Hamori and Ruskin  was published, various kinds of sequence comparison methods based on the graphical representation have been proposed by many researchers. The basic procedure of the graphical representation is outlined as follows: first, each character in a biological sequence, which is expressed by the four-letter alphabet for nucleotide sequences and the 20-letter alphabet for amino acid sequences, is expressed by individual vectors in a certain dimensional space; next, the vectors are connected successively in a head-to-tail fashion, drawing a curve, or a
In this chapter, we briefly review the graphical representation methods for biological sequence comparison. In addition, we introduce our work recently published, in which weighting during the course of the graphical representation shows the notable effects.
2. Variations of graphical representations
The graphical representation methods are classified into some classes according to the target sequences and the dimension of the representation space. The target sequences of the graphical representation are amino acid sequences of proteins and nucleotide sequences of DNA (or RNA), including specific genes, mitochondrial genomes, and others. Table 1 summarizes the classification of the graphical representation methods published so far.
|Target sequence||Dimension of expression space||Work|
|Specific genes||2D||[4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]|
|3D||[23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]|
|Mitochondrial genomes||2D||[37, 38, 39, 40, 41]|
|2D||[44, 45, 46, 47, 48, 49]|
|3D||[50, 51, 52, 53]|
2.1. Graphical representation of DNA sequences
Biological sequences stored in data archives are expressed by the four-letter alphabet for nucleotide sequences of DNA and the 20-letter alphabet for amino acid sequences of proteins. To represent the biological sequences by graphs, it is necessary to express each character composing the sequences in numerical form.
The most popular strategy for the numerical expression is assigning vectors to respective characters in the alphabet. As for nucleotide sequences of DNA, the individual vectors of two, three, or higher dimension are assigned to four types of bases, A, T, G, and C.
2.1.1. Two-dimensional representation
Figure 1 is the two-dimensional vector assignment utilized by Gates . Although many variations of the assignments are given according to the layout of the four bases, the number of the independent assignments is reduced to 3!/2 = 3, when the assignments that are transformed to each other by rotation on the
By connecting the vectors successively in a head-to-tail fashion according to each base appearing in a nucleotide sequence, a graphical representation is generated. Figure 2 shows, as an example, the graphical representation of sequence “TGAGTTC” generated by Gates’ assignment.
The assignment of Figure 1 may draw circuits in the graphical representation, leading to the loss of information that the original biological sequence has. To get rid of the degeneracy, Yau  introduced the assignment shown in Figure 3, which makes no circuit in the graphical representation; because the
Some researchers used another approach; they directly mapped bases on the
2.1.2. Three-dimensional representation
Hamori and Ruskin  used a three-dimensional vector assignment to bases (Figure 6). Gates’ approach (Figure 1)  is a simplified version of this assignment. However, unlike Gates’ approach, Hamori’s assignment does not make any circuit in the resultant curve (called
Zhang and Zhang  used another three-dimensional vector assignment shown in Figure 7. The resultant curve, called
2.1.3. Higher than three dimensions
The graphical representations in the space of higher than three dimensions cannot be visualized directly. Instead of direct visualization, they are expressed abstractly or projected on some spaces of lower dimensions. The approaches of this type (including the variations with some modifications) are utilized in Refs. [25, 30, 28].
2.2. Graphical representation of proteins
A general strategy for graphical representation of protein sequences is common to that for DNA sequences, namely, numerical expression of characters followed by mapping on certain dimensional spaces, except for the fact that the number of character types is 20 instead of 4 for DNA sequences. A detailed review of graphical representation of protein sequences is given by Randić et al. . Here, we briefly mention the variations of the graphical representation scheme of proteins.
Figure 8(a) and (b) presents two-dimensional vector assignments to 20 amino acids utilized by Randić et al.  and Wen and Zhang , respectively. In Randić’s assignment, the 20 amino acids (indicated by three-letter codes) are arranged uniformly on a unit circle in alphabetical order. On the other hand, in Wen’s assignment, the horizontal and the vertical coordinates of the vectors are given by
He et al.  extended Randić’s vector assignment (Figure 8(a)) to three dimensions by adding one extra coordinate corresponding to the position of the amino acid in the original sequence, with the modification of the arrangement of the 20 amino acids on the unit circle based on the 6-bit binary gray code assigned by the codon structure of the amino acids.
3. Numerical characterization of graphical representations
As well as the visual evaluation of the similarities between biological sequences through their graphical representations, the quantitative estimation of the similarities also can be done by the numerical characterization of the graphs. The general method of the quantitative estimation is to construct
For the numerical characterization, there are two kinds of methods: geometrical methods and graph-theoretical ones .
3.1. Geometrical characterization
The most simple method of the geometrical characterization was proposed by Raychaudhury and Nandy , in which the graphs are numerically characterized by their geometrical centers. Let be the coordinate of the
A more accomplished geometrical characterization was proposed by Liao et al. , in which they constructed a two-component feature vector based on the 2×2 covariance matrix
The two-component vector is given by the two eigenvalues of
The approach proposed by Qi et al.  is another example of the geometrical characterization. They constructed an eight-component feature vector from the averages of the
3.2. Graph-theoretical characterization
The graph-theoretical characterization that is most widely used is the method based on the D/D (distance/distance) matrix . The off-diagonal (
There are two variations of the D/D matrix. If the denominator (the graph-theoretical distance) is replaced by the sum of the geometrical lengths of the edges between the two vertices, the D/D matrix is denoted as the L/L matrix; if the denominator is replaced by the total number of the edges between the two vertices, the D/D matrix is denoted as the M/M matrix.
The feature vectors are constructed from the leading eigenvalues of the D/D matrix, which are the invariants of the matrix and can well describe the characteristics of the individual graphs. For example, Randić et al.  used 12-component vectors given by the first leading eigenvalues of the L/L matrices calculated from the 12 essentially different patterns of the graphical representations, and Liao and Wang  used three-component vectors constructed by the similar manner.
The similarity/dissimilarity between the sequences, A and B, is measured by the Euclidean distance between the end points of the corresponding feature vectors by
or the cosine of the angle between the feature vectors by
where and are the
4. Graphical representation based on binary images
The author and co-workers recently published the paper about a novel two-dimensional graphical representation of DNA sequences based on binary images . In this section, we introduce our method and demonstrate the notable effects of
4.1. Vector assignment
We used the two-dimensional vector assignment to four bases shown in Figure 9, which is a modified version of Gates’ assignment (Figure 1). We located both G and C on the same side so that the GC-contents of the target sequences can be represented on the graphs; the graphs for the sequences with high GC-contents tend to grow in the downward direction, although the tendency is not rigid due to the weighting mentioned below.
4.2. Introducing weighting
In order to extract potential information conveyed by individual bases in DNA sequences, we introduced
The conditional probability is calculated from the appearance frequencies of triplets of bases. For example, the probability that base A occurs after a pair of bases TC is given by
Table 3 lists the weighting factors calculated with base 4 of the logarithm in Eq. (5) from 31 mammalian mitochondrial genomes. The weighting factor lower than 1.00 indicates that the pair of bases on the row tends to be followed by the base on the column, and on the other hand, the weighting factor higher than 1.00 indicates that, after the pair of bases on the row, the base on the column is hard to appear.
|Preceding pair of bases||Third base|
Let us illustrate the procedure of the graphical representation with weighting factors by a simple example. Figure 10(a) and (b) shows the graphical representations of sequence “ACATATG” by Kobori’s vector assignment (Figure 9) without and with weighting, respectively. The weighting is not applied to the first two bases, because the weighting factors are not given for the first two bases by our weighting scheme. The weighting factors for the subsequent bases A, T, A, T, and G are 0.83, 0.90, 0.84, 0.92, and 1.42, respectively (see Table 3). The vectors for the bases are multiplied by the corresponding weighting factors. As a result, the graphical representation in Figure 10(a) is modified as shown in Figure 10(b).
We demonstrate the notable effects of the weighting on the graphical representations by the real sequences. Figure 11 depicts the graphical representations of three mammalian mitochondrial genomes without weighting and with weighting. Comparing the graphs with weighting (lower row) to the graphs without weighting (upper row), it can be recognized that the characteristics of the graphs are emphasized by the weighting and the individual species can be easily distinguished.
4.3. Generating binary images
A binary image is defined as a digitized image composed of the pixels with two possible values (e.g., 0 and 1), which are typically assigned by
4.4. Numerical characterization by local pattern histograms
In this work, each graph is characterized by the frequency distributions of
4.5. Distance measures between local pattern histograms
There are several measures to estimate similarity/dissimilarity between two histograms. Here, we briefly mention five frequently used methods. In the following formulas,
4.5.1. Histogram intersection
Histogram intersection was proposed by Swain et al.  for color indexing of images, which is defined as
It ranges from 0 to 1, with 1 for
4.5.2. Manhattan distance
Manhattan distance, also known as
which ranges from 0 to 2, with 0 for
4.5.3. Bhattacharyya distance
Bhattacharyya distance  is defined between two probability distributions from a divergence
which ranges from 0 to 1, with 1 for
4.5.4. Jensen-Shannon divergence
where and is the Kullback-Leibler divergence calculated by
Here, . Note that the local patterns having are excluded from the calculation. The Jensen-Shannon divergence ranges from 0 to 1, with 0 for
4.5.5. Kendall’s rank correlation coefficient
Kendall’s rank correlation coefficient , also known as Kendall’s
so that ranges from 0 to 1, with 0 for the rank orders of
4.6. Reconstruction of phylogenetic tree
Among the five distance measures mentioned above, histogram intersection and Manhattan distance showed the best performance. Figure 13 shows the phylogenetic tree of 31 mammalian species reconstructed by our method using Unweighted Pair Group Method with Arithmetic mean (UPGMA) with the histogram intersection distance measure. The same tree is given by Manhattan distance.
With the rapid growth of the data size in the archives of biological sequences, the demand for the alignment-free sequence comparison methods is increasing. Graphical representation is one of the major alignment-free sequence comparison methods. In addition to the visual discrimination abilities of the sequences, the graphical representation has an advantage of requiring only small computational time. The similarity/dissimilarity between a pair of sequences is calculated from the feature vectors constructed based on the graphical representation. The time complexity of the calculation is estimated to be