The ratios of amino acids to the total amino acids deduced from the complete genome and those of nucleotides to the total nucleotides in the genome are useful indexes to characterize various large genomes among different species from bacteria to Homo sapiens. These indexes are not only independent of species but also of genome size. Using these indexes, the following results were obtained: (1) primitive life forms appeared to have similar amino acid compositions to present day organisms; (2) cellular amino acid compositions that are similar among various species and between whole cells and complete genomes; (3) genome structure that is homogeneously constructed from putative small units encoding proteins of similar amino acid compositions, followed by synchronous mutations over the genome; (4) all organisms can be classified into two groups, “GC-rich” and “AT-rich,” based on their nucleotide contents, or “terrestrial” and “aquatic vertebrates” based on natural selection by cluster analyses using amino acid contents as the traits; and (5) evolution based on nucleotide content alterations can be expressed by definitive equations. Thus, the ratios of amino acids or nucleotides to their total contents are useful indexes for characterizing genomes, regardless of species differences and genome sizes. The two normalized nucleotide contents are universally expressed regression line.
- Chargaff’s parity rules
- cluster analysis
- phylogenetic trees
The origin of life has long been interested to human since old times. Indeed, Aristotle proposed “spontaneous generation” more than 2000 years ago, although this idea was disproved by Louis Pasteur in experiments using “swan neck flasks.” Our great interest in the origin of life might be expressed by the following philosophical words:
The development of nucleotide sequencing technology [1, 2] has contributed to progress in molecular biology, including the analysis of a complete bacterial genome first carried out in 1995 , and, subsequently, the draft human genome, which was reported in 2001 [4, 5]. At present (June 19, 2019), 498 eukaryote, 5159 bacterial, and 296 archaeal complete genomes were determined. However, the origin of life is still unclear. Assuming that the replacement rates of nucleotides or amino acids in genes are constant , phylogenetic trees were drawn [6, 7, 8, 9, 10, 11]. However, we know that their exact replacement rates differ between genes and between species. Studies based on nucleotide or amino acid sequences are applicable to genes whose nucleotide or amino acid numbers are much smaller than those of complete genomes, but not to genomes consisting of huge numbers of nucleotides and many genes. Of course, simple comparison of sequence differences between genes in the same species and the same genes in different species is useful.
Intraspecies nucleotide contents were first analyzed in 1950 by Chargaff, who reported that G = C, A = T, and [(G + A) = (C + T)] , which was named as Chargaff’s first parity rule. This rule is understandable based on the double-stranded DNA structure . Additionally, this rule is applicable to single-stranded DNA obtained from a single species nucleus, termed Chargaff’s second parity rule . As the rules are based on normalized values to 1 (G + C + A + T = 1), nucleotide contents are expressed by their ratios. However, the second parity rule is more difficult to understand because we could not image how G and C or T and A pairs are formed in the single DNA strand. Recently, this puzzle has been solved mathematically, using the similarity of the forward and reverse strands and homogeneity of the DNA strand over the genome structure . Although Chargaff’s parity rules represent original intraspecies phenomena, the rules can be expanded to inter-species phenomena using data from a large number of complete genomes : the second parity rule is applicable only to a single DNA strand from a double-stranded DNA molecule.
Sueoka  was the first to analyze the cellular amino acid composition in bacteria, and our laboratory has independently analyzed the cellular amino acid compositions of bacteria, archaea, and eukaryotes . Graphical representation or a diagrammatic approach to the study of complicated biological systems can provide an intuitive picture and provide useful insights [19, 20]. Using certain graphical presentations, huge data sets from genomes can be easily recognized as simple patterns representing complicated organisms. Indeed, using a radar chart to express cellular amino acid compositions, their patterns, a “star-shape,” are similar among various organisms, and their differences seem to reflect biological evolution . In addition, the amino acid compositions deduced from complete genomes resemble those obtained from amino acid analyses of cell lysates . These results suggest that the ratios of amino acids to the total amino acids and those of nucleotides to the total nucleotide content are useful indices to characterize whole genome structures .
3. Patternalization of amino acid compositions
In general, there are 20 amino acids that can form proteins, and the amino acid sequences are strictly controlled by 64 codons consisting of three nucleotides, a triplet. Thus, differences in amino acid sequences of the same kind of proteins reflect biological evolution among species, although differences among different kinds of proteins seem not to be significant. Furthermore, sequence comparisons of protein mixtures are theoretically too complex to consider given currently available tools. Conversely, the amino acid composition predicted from protein(s) can characterize protein(s) from a different point of view, not only among the same organisms, but also among different organisms. In fact, the cellular amino acid compositions of various bacteria have been analyzed . Based on the 20 amino acids that comprise proteins, there were 20 traits that could be evaluated, which, at first glance, seemed too many to provide meaningful information for cells. However, using a radar chart to present the amino acid compositions, the data could be patternalized, and the amino acid composition was observed to represent certain cellular characteristics, as shown in Figure 1. The patterns of bacteria (
4. Chronological precedence of protein formation over codon formation
To understand the establishment of primitive organisms, the chronological precedence of protein and codon formation is a very important subject in biological evolution. Unfortunately, this theory has not yet been proven, because primitive organisms were formed under so many unknown factors an extremely long time ago. However, a simulation analysis based on a random choice of amino acids or nucleotides was carried out, which assumed that their polymerization depended on their free monomer concentrations, according to the chemical reaction rule that governs natural phenomena. Amino acid polymerizations produced a protein which reflected original free amino acid concentrations without codons, while nucleotide polymerizations did not produce functional proteins, even after considering the codon table, as shown in Figure 2 . Therefore, it seems difficult to predict “the RNA world” which presumes that RNA polymers formed primitive life forms . Additionally, the possibility of the accumulation of RNA, which has a UV absorbance at around 250 nm, might be very low under the strong UV irradiation present on the primitive Earth. These results suggest that protein formation might chronologically precede codon formation at the end of prebiotic evolution, although we have no explanation of how the nucleotide sequence information necessary for proteins might have been transmitted to the nucleotide polymerization that established the codons. The “amino acid world”  seems a better fit for primitive life forms rather than the “RNA world.” There are several hypotheses for codon formation [27, 28, 29], but the process of codon formation has not yet been determined.
According to our simulation analyses , proteins that were components of primitive life forms might reflect the free amino acid concentrations on the primitive Earth. As shown in Figure 1, the cellular basic amino acid composition, the “star-shape,” is characterized by comparatively high concentrations of hydrophobic amino acids, such as valine, leucine, and isoleucine. The glycine and alanine contents were also comparatively high. The former might contribute to self-aggregation of proteins via hydrophobicity to form primitive life forms under low protein concentrations, and the latter might reflect their easy formation on the primitive Earth. In fact, simple amino acids such as glycine and alanine have been identified in meteorites [30, 31] and can be formed by electrical discharge in an atmosphere presumed to reflect primitive Earth . Conversely, the phenylalanine, tryptophan, and tyrosine content, which can absorb ultraviolet light, were quite low. Strong ultraviolet irradiation might induce photodegradation of these amino acids. The differences in amino acid contents in cellular amino acid compositions seem to reflect the presumed free amino acid concentrations on the primitive Earth and eventually resulted in the formation of the “star-shaped” cellular amino acid compositions (Figure 1).
5. Amino acid compositions deduced from complete genomes
Initially, amino acid compositions were deduced from complete genomes by assuming that each gene is equally expressed in a whole cell . This resulted in the amino acid composition deduced from the complete genome resembling the cellular amino acid composition obtained from the amino acid analyses of cell lysates , as shown in Figure 3. This coincidence is difficult to understand because of the different origins of both values, until the genome structure has been clarified, as shown in the next section.
6. Homogeneity of genome structure
Each gene has its characteristic amino acid or nucleotide sequence, and its amino acid or nucleotide composition differs not only in inter-species but also in intraspecies. Conversely, gene assemblies encoding 3000–7000 amino acid residues show very similar amino acid compositions  and nucleotide compositions  in intraspecies examinations. Consistent results were obtained from whole chromosomes consisting of putative small units of 3000–7000 amino acid residues . Additionally, it has been shown mathematically that 3000–7000 amino acid residues represent the amino acid composition of a certain amino acid pool . Thus, genome structure, which is constructed homogeneously from putative similar small units, can be represented by a “pearl-necklace,” as shown in Figure 4. The fact that the structure of a genome is homogeneously constructed with putative similar small units indicates that micro-alterations of nucleotide sequences are canceled out within the small unit and that the small unit represents the whole genome characteristics. Macro-alterations represented by the small unit, and based on species differences, occur synchronously over the genome . This conclusion has never been obtained from the analysis of nucleotide or amino acid sequences of actual genes. Based on these results, the ratios of amino acids to the total amino acids or those of nucleotides to the total nucleotides form useful indices for characterizing a genome whose nucleotide numbers differ among species.
7. Nucleotide compositions
As described above, the intraspecies rule of nucleotide composition was reported by Chargaff in 1950, as the first parity rule , and a similar parity rule regarding the single DNA strand was reported by the same group in 1968, as the second parity rule . Using the normalized values to 1 (G + C + T + A = 1), the following relationships are obtained: G = C, T = A, and [(G + A) = (C + T)]. Recently, Mitchell and Bridge  reported that Chargaff’s second parity rule is applicable to a single DNA strand comprising a double-stranded DNA, based on many complete genome data among various species. Conversely, we showed that chloroplast and plant mitochondrial DNA and nuclear DNA obey Chargaff’s second parity rule as an inter-species rule , and that the second parity rule was applicable to the nucleotide relationships not only in the coding region, but also in non-coding regions compared with those of the complete single DNA strand [37, 38]. When invertebrate mitochondrial DNA is classified into two groups, high C/G and low C/G ratios, nucleotide content relationships may be expressed by linear formulae . However, organellar DNA deviated from Chargaff’s second parity rule and nucleotide relationships were heteroskedastic [16, 39, 40]. The fact that all regression lines based on different kingdoms closed at the same single point suggests that all species descended from a single origin . This is the first demonstration based on scientific evidence that all species were descended from a single origin of life. This concept has been presumed since Darwin’s theory “Origin of Species” was published in 1859. Charles Darwin discussed evolution over the course of generations via the presence of “Natural Selection” in “On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life”; however, he discussed neither “a single origin” nor “a common ancestor” of species. The two regression lines of nucleotide relationships based on coding and non-coding regions closed to form a wedge-shape, because both fragments exist on the same DNA strand . Similarly, the two regression lines based on chloroplast and plant mitochondrial DNA also closed to form a wedge-shape . Thus, both organellar DNA independently descended from the same origin in biological evolution. Quite recently, it has been shown that vertebrates are descended from a certain invertebrate . However, although the phylogenetic trees [7, 8, 9, 10, 11] have an apparent single origin, these “facts” are merely mathematical calculation results.
8. Diagonal genome universe
Chargaff’s parity rules were originally based on intraspecies phenomena [12, 14], and the rules are applicable to inter-species evolutionary phenomena for nuclear, chloroplast, and plant mitochondria as mentioned above. The rules are represented by the following equations: G = C, T = A, [(G + A) = (C + T)]. As all values are normalized to 1, Chargaff’s parity rule can also be represented as: 2G + 2A = 1, A = 0.5 – G, T = 0.5 – G, C = G, G = (G). The lines G and C overlap and the lines A and T overlap, and the former is line symmetrical to the latter against the line y = 0.25, as shown in Figure 5. These equations mean that four nucleotide contents can be expressed by just one nucleotide content using regression lines (Figure 5), and the two duplicate nucleotide contents (G or C and T or A) are symmetrical. Thus, the four nucleotide contents (two duplicate points) move strictly on the diagonal of 0.5 of a square in nuclear, chloroplast, and mitochondrial DNA, which obey Chargaff’s second parity rule. Therefore, biological evolution caused by nucleotide alterations is expressed on the diagonal of a 0.5 square: the “diagonal genome universe” , although biological evolution shows a wide spectrum of phenotypic expressions over a 3.5-billion-year period.
9. Codon evolution
The 20 amino acids are encoded by genes using nucleotide triplets; therefore, these sequences are determined according to triplet sequences. Additionally, amino acid sequences differ not only inter-gene but also intraspecies. These facts indicate that a comparison of codon evolution based on the complete genome, which comprises large numbers of different genes, would not be significant. Indeed, no clear evaluation has been obtained, despite the attempted explanations of many scientists [27, 28, 29]. However, as described in the previous section, it has been clarified that a whole genome is constructed from putative small units that encode proteins of similar amino acid composition. This suggests that the total codon usage deduced from the complete genome is stable and represents the whole genome characteristic. According to this concept, correlationships of nucleotide contents in a complete genome can be expressed by the linear formula, y = ax + b; where “y” and “x” are nucleotide contents, and “a” and “b” are constant values. In addition, as each codon usage is expressed by a linear formula among various organisms, the determination of any one nucleotide content in certain organism can essentially estimate other three nucleotide contents and, therefore, the 64 codon usages (Figure 6). The estimated codon usage patterns and amino acid compositions are almost the same between the original experimental results and estimated results. The codon usage patterns clearly indicate that codon usages changed synchronously among the 64 codons during biological evolution.
10. Natural selection in biological evolution based on amino acid contents
The above mentioned theories have been described in previous review articles [36, 43]; therefore, in this section, unique applications based on the amino acid compositions or nucleotide contents in the construction of phylogenetic trees to study evolution are presented using recent data.
The theory of natural selection was promoted by Charles Darwin and Alfred Wallace 150 years ago. This theory was derived from specific differences or similarities in the phenotypes of organisms that lived on geologically isolated islands. The theory of biological evolution has been further developed by paleontology , using phenotypic changes in fossils, and by molecular biology , using genotypic modifications (nucleotides or amino acids) of genes in living organisms.
Generally, the nucleotide or amino acid sequences of a particular gene or genes have been the focus of biological evolution studies, and many phylogenetic trees have been constructed using nucleotide or amino acid sequences [7, 8, 9, 10, 11, 27, 29, 45]. Conversely, the amino acid compositions or nucleotide contents have been rarely used for whole genome research. However, these indices have been used to classify bacteria, archaea, and eukaryotes  and recently vertebrate evolution . In those studies, all organisms could be classified into two types, “GC-rich” and “AT-rich,” and the vertebrates examined were further classified into two groups: terrestrial and aquatic vertebrates, based on natural selection. A similar result was obtained from an analysis based on 16S rRNA sequences [45, 47].
When the normalized amino acid compositions of vertebrate and invertebrate complete mitochondrial genomes were used, the groups were separated cleanly into two large clusters, vertebrates and invertebrates (Figure 7). In invertebrates, starfish (Echinodermata) formed a small cluster, and squids and octopus (Mollusca) were grouped into the same cluster. Vertebrates were further classified into three major clusters, mammals, fish, and a mixture of reptiles and amphibians. For example, primates (human, chimpanzee, and gorilla) formed a small cluster. Thus, close species fell into the same cluster and did not split into different clusters. These results indicate that the normalized values of amino acid and nucleotide contents calculated from complete genomes could be used to characterize organisms and to construct phylogenetic trees. Our results based on complete mitochondrial genomes revealed that hemichordates (
In a previous study to classify vertebrates [49, 50], as organisms were chosen at random without any preposition, it was difficult to evaluate whether the classification results were reasonable in the phylogenetic trees. Using the amino acid composition as the trait, the vertebrates examined were separated into two major clusters (Figure 8), terrestrial and aquatic vertebrates. The exceptions were the hagfish (
Single genes have been used to construct phylogenetic trees [7, 8, 9, 10, 11], and 16S rRNA has been frequently examined [27, 29]. The phylogenetic tree based on 16S rRNA sequences of various vertebrates is shown in Figure 9. The tree is consistent with that based on nucleotide contents. The hagfish (
11. Organelle evolution
In Chargaff’s first parity rule , G = C and A = T in a double DNA strand, while in the second parity rule , G ≈ C and A ≈ T in a complete single DNA strand. Based on Chargaff’s second parity rule, nucleotide content differences such as (G – C) and (A – T) reflect biological evolution. In addition, the other nucleotide content differences, (G – A, G – T, C – A, and C – T), also reflect biological evolution [34, 53].
Six nucleotide content differences among the complete mitochondria of the four species (
To allow simple comparison of inter- and intraspecies genome structures, genomes were divided into three fragments throughout subsequent analyses, from which three separate patterns emerged. There is no inversion of nucleotide content differences that was observed in the mtDNA of
In the mtDNA of primate species
12. Definitive universal equations
In the normalization of nucleotide contents (G + C + A + T = 1), as (G = C) and (A = T) based on Chargaff’s parity rules, (2G + 2A = 1) is obtained. This equation is altered to (A = 0.5 – G) and then (A – G = 0.5 – 2G). Finally, G – A = 2G – 0.5. The relationship between (G – A) and (G) is linear when both (G) and (A) are expressed by linear functions. In animal mitochondria, only the correlations between the two purines (A versus G) or the two pyrimidines (C versus T) are linear, while the correlations between purines and pyrimidines (A or G versus T or C) are weak or not correlated at all . For example, when plotting (G – C), (G – T), (G – A), (C – T), (C – A), and (T – C) against G content, only (G – A) versus G content was linear in vertebrate mitochondria . In invertebrate mitochondria, plotting nucleotide content differences against G content was weakly linear.
Plotting (X – Y)/(X + Y) against (X – Y), the following linear relationship was obtained in mitochondria, chloroplasts, and chromosomes (Figure 12): (X – Y)/(X + Y) = a (X – Y) + b, where X and Y are nucleotide contents, and (a) and (b) are constants. As (b) was almost null and (a) was ~2.0, (X – Y)/(X + Y) ≈ 2.0 (X – Y). In these genome analyses, which are independent of Chargaff’s parity rules, the values of (a) for (G, C), (G, A), (G, T), (C, T), (C, A), and (A, T) were 2.5858, 1.85558, 1.9908, 1.9771, 1.9968, and 1.5689, respectively, in our previous results [53, 54]. Based on these results, (G + C), (G + A), (G + T), (C + A), (C + T), and (A + T) were 0.39, 0.54, 0.50, 0.51, 0.50, and 0.64, respectively. In virus genome analyses [53, 54], the constant values for (a) were 1.9–2.1, and the values for (X + Y) were 0.47–0.53. In contrast, in the normalization of nucleotide contents (G + C + A + T = 1), as (G = C) and (A = T) based on Chargaff’s parity rules, (2G + 2A = 1) is obtained. This equation is altered to (G + A = 0.5). This value is consistent with the value obtained above from genome analyses. Similarly, (G + T = 0.5), (C + A = 0.5), and (C + T = 0.5), although (G + C) and (A + T) cannot be determined. Therefore, the four nucleotide contents are expressed by the following regression lines, plotted against G content: A = 0.5 – G, T = 0.5 – G, C = G, and G = G. Lines G and C overlap, as do lines A and T, and the former line is symmetrical to the latter against line (y = 0.25). The intercepts of lines G and C are close to the origin, while those of lines A and T are close to 0.5 at the vertical and horizontal axes. All organisms from bacteria to
A linear regression line was not obtained when using randomly chosen value (Figure 12A). Furthermore, plotting (X – Y)/(X + Y) against (X/Y), the following logarithmic function was obtained for all tested genomes as well as when using randomly chosen values (Figure 12B): (X – Y)/(X + Y) = a ln (X/Y) + b. As (b) was almost null and (a) was ~0.5, (X – Y)/(X + Y) ≈ 0.5 ln (X/Y). The ratio between two values, (X/Y), can be expressed by a logarithmic function, ~0.5 ln (X/Y) ≈ (X – Y)/(X + Y). Plotting the GC skew vs. G content, animal mitochondria were classified into two groups: high and low C/G . This fact indicates that the ratio C/G and the GC skew are evolutionarily related to each other. Any change can be expressed universally by a definitive logarithmic function, (X – Y)/(X + Y) = a ln (X/Y) + b. The present results indicate that cellular organelle evolution is strictly controlled under these characteristic rules, although non-animal mitochondria, chloroplasts, and chromosomes are controlled under Chargaff’s parity rules [12, 14]. The present study clearly shows that biological evolution, which seems to be based on complicated processes, is governed by simple universal equations.
The ratios of amino acids to the total amino acids or of nucleotides to total nucleotides predicted from complete genomes consisting of huge number of nucleotides can characterize a whole organism. In addition, as these values are independent of species and genome size, these indexes are very useful for genome research, as well as single gene research. The validity of these indexes is clearly based on the homogeneity of genomic structures. In addition, patternalization of values after simple calculations based on large data sets can provide an intuitive picture and provide useful insights, revealing the homogeneity of genomic structures followed by synchronous alterations over the genome. In addition, any change between two values, X and Y, including biological evolution can be expressed definitively by a linear regression line equation, (X – Y)/(X + Y) = a (X – Y) + b, where X and Y are nucleotide contents, and (a) and (b) are constants, and by a logarithmic function, (X – Y)/(X + Y) = a′ ln (X/Y) + b′, where (a′) and (b′) are constants. As the present review is based on the endeavors and data of numerous scientists from all over the world, the author would like to express finally his following feeling as one of scientists. (Human being is an organism of huge numbers of organisms on the Earth, and we are not ranked as a special species above all organisms as a result of long evolution.) However, we have made the present modern civilization based on fossil energy usage which seems to induce climate changes. Thus, we must be responsible to establish sustainable development not only for Human being but also for other organisms. The Earth is for all organisms, not only for Human being.
The author greatly acknowledges President Hiroyuki Okada of Shinko Sangyo, Co. Ltd., Takasaki, Gunma, Japan for his financial support and Dr. Teiji Okayasu who was one of collaborators in Dokkyo Medical University for his excellent computer analyses.