Various combinations of three bases in the coding strand of DNA are used to code for individual amino acids - shown by their three letter abbreviation
Physical theories often start out as theories which only embrace essential features of the macroscopic world, where their predictions depend on certain parameters that have to be either assumed or taken from experiments; as a result these parameters cannot be predicted by such theories. To understand why the parameters have the values they do, we have to go one level deeper—typically to smaller scales where the easiest processes to study are the ones at the lowest level. When the deeper level reduces the number of unknown parameters, we consider the theory to be complete and satisfactory. The level below conventional molecular biology is spanned by atomic and molecular structure and by quantum dynamics. However, it is also true that at the lowest level it becomes very difficult to grasp all the features of the molecular processes that occur in living systems such that the complexity of the numerous parameters that are involved make the endeavour a very intricate one. Information theory provides a powerful framework for extracting essential features of complicated processes of life, and then analyzing them in a systematic manner. In connection to the latter, quantum information biology is a new field of scientific inquiry in which information-theoretical tools and concepts are permitting to get insight into some of the most basic and yet unsolved questions of molecular biology.
Chirality is often glossed over in theoretical or experimental discussions concerning the origin of life, but the ubiquity of homochiral building blocks in known biological systems demands explanation. Information theory can provide a quantitative framework for understanding the role of chirality in biology. So far it has been thought that the genetic code is “unknowable” by considering DNA as a string of letters only (... ATTGCAAGC...) and likewise by considering proteins as strings of identifiers (... DYRFQ...), we believe that this particular conclusion might be probably wrong because it entirely fails to consider the information content of the molecular structures themselves and their conformations.
On the other hand, according to molecular biology, living systems consist of building blocks which are encoded in nucleic acids (DNA and RNA) and proteins, which possess complex patterns that control all biological functions. Despite the fact that natural processes select particular building blocks which possess chemical simplicity (for easy availability and quick synthesis) and functional ability (for implementing the desired tasks), the most intriguing question resides in the amino acid selectivity towards a specific codon/anticodon. The universal triplet genetic code has considerable and non-uniform degeneracy, with 64 codons carrying 21 signals (including Stop) as shown in Table 1. Although there is a rough rule of similar codons for similar amino acids, no clear pattern is obvious.
Information theory of quantum many-body systems is at the borderline of the development of physical sciences, in which major areas of research are interconnected, i.e., physics, mathematics, chemistry, and biology. Therefore, there is an inherent interest for applying theoretic-information ideas and methodologies to chemical, mesoscopic and biological systems along with the processes they exert. On the other hand, in recent years there has been an increasing interest in applying complexity concepts to study physical, chemical and biological phenomena. Complexity measures are understood as general indicators of pattern, structure, and correlation in systems or processes. Several alternative mathematical notions have been proposed for quantifying the concepts of complexity and information, including the Kolmogorov–Chaitin or algorithmic information theory (Kolmogorov, 1965; Chaitin, 1966), the classical information theory of Shannon and Weaver (Shannon & Weaver, 1948), Fisher information (Fisher, 1925; Frieden, 2004), and the logical (Bennet, 1988) and the thermodynamical (Lloyd & Pagels, 1988) depths, among others. Some of them share rigorous connections with others as well as with Bayes and information theory (Vitanyi & Li, 2000). The term complexity has been applied with different meanings: algorithmic, geometrical, computational, stochastic, effective, statistical, and structural among others and it has been employed in many fields: dynamical systems, disordered systems, spatial patterns, language, multielectronic systems, cellular automata, neuronal networks, self-organization, DNA analyses, social sciences, among others (Shalizi et al., 2004; Rosso et al., 2003; Chatzisavvas et al., 2005; Borgoo et al., 2007).
The definition of complexity is not unique, its quantitative characterization has been an important subject of research and it has received considerable attention (Feldman & Crutchfield, 1998; Lamberti et al., 2004). The usefulness of each definition depends on the type of system or process under study, the level of the description, and the scale of the interactions among either elementary particles, atoms, molecules, biological systems, etc.. Fundamental concepts such as uncertainty or randomness are frequently employed in the definitions of complexity, although some other concepts like clustering, order, localization or organization might be also important for characterizing the complexity of systems or processes. It is not clear how the aforementioned concepts might intervene in the definitions so as to quantitatively assess the complexity of the system. However, recent proposals have formulated this quantity as a product of two factors, taking into account order/disequilibrium and delocalization/uncertainty. This is the case of the definition of López-Mancini-Calbet (LMC) shape complexity [9-12] that, like others, satisfies the boundary conditions by reaching its minimal value in the extreme ordered and disordered limits. The LMC complexity measure has been criticized (Anteonodo & Plastino, 1996), modified (Catalán et al., 2002; Martin et al., 2003) and generalized (López-Ruiz, 2005) leading to a useful estimator which satisfies several desirable properties of invariance under scaling transfromations, translation, and replication (Yamano, 2004; Yamano, 1995). The utility of this improved complexity has been verified in many fields  and allows reliable detection of periodic, quasiperiodic, linear stochastic, and chaotic dynamics (Yamano, 2004; López-Ruiz et al., 1995; Yamano, 1995). The LMC measure is constructed as the product of two important information-theoretic quantities (see below): the so-called disequilibrium D (also known as self-similarity (Carbó-Dorca et al., 1980) or information energy Onicescu, 1996), which quantifies the departure of the probability density from uniformity (Catalán et al., 2002; Martinet al., 2003) (equiprobability) and the Shannon entropy S, which is a general measure of randomness/uncertainty of the probability density (Shannon & Weaver, 1948), and quantifies the departure of the probability density from localizability. Both global quantities are closely related to the measure of spread of a probability distribution.
The Fisher-Shannon product
The Fisher information
In line with the aforementioned developments we have undertaken multidisciplinary research projects so as to employ IT at different levels, classical (Shannon, Fisher, complexity, etc) and quantum (von Neumann and other entanglement measures) on a variety of chemical processes, organic and nanostructured molecules. Recently, significant advances in chemistry have been achieved by use of Shannon entropies through the localized/delocalized features of the electron distributions allowing a phenomenological description of the course of elementary chemical reactions by revealing important chemical regions that are not present in the energy profile such as the ones in which bond forming and bond breaking occur (Esquivel et al., 2009). Further, the synchronous reaction mechanism of a SN2 type chemical reaction and the non-synchronous mechanistic behavior of the simplest hydrogenic abstraction reaction were predicted by use of Shannon entropies analysis (Esquivel et al., 2010a). In addition, a recent study on the three-center insertion reaction of silylene has shown that the information-theoretical measures provide evidence to support the concept of a continuum of transient of Zewail and Polanyi for the transition state rather than a single state, which is also in agreement with other analyses (Esquivel et al., 2010b). While the Shannon entropy has remained the major tool in IT, there have been numerous applications of Fisher information through the “narrowness/disorder” features of electron densities in conjugated spaces. Thus, in chemical reactions the Fisher measure has been employed to analyze its local features (Esquivel et al., 2010c) and also to study the steric effect of the conformational barrier of ethane ( Esquivel et al., 2011 a). Complexity of the physical, chemical and biological systems is a topic of great contemporary interest. The quantification of complexity of real systems is a formidable task, although various single and composite information-theoretic measures have been proposed. For instance, Shannon entropy (
In the Chapter we will present arguments based on the information content of L- and D-aminoacids to explain the biological preference toward homochirality. Besides, we present benchmark results for the information content of codons and aminoacids based on information-theoretical measures and statistical complexity factors which allow to elucidate the coding links between these building blocks and their selectivity.
2. Information-theoretical measures and complexities
In the independent-particle approximation, the total density distribution in a molecule is a sum of contribution from the electrons in each of the occupied orbitals. This is the case in both
Standard procedures for the Fourier transformation of position space orbitals generated by ab-initio methods have been described (Rawlings & Davidson, 1985). The orbitals employed in ab-initio methods are linear combinations of atomic basis functions and since analytic expressions are known for the Fourier transforms of such basis functions (Kaijser & Smith, 1997), the transformation of the total molecular electronic wavefunction from position to momentum space is computationally straightforward (Kohout, 2007).
As we mentioned in the introduction, the
from which the exponential entropy is defined. Similar expressions for the
It is important to mention that the
which depends on the Shannon entropy defined above. So that, the
in momentum space.
Let us remark that the factors in the power Shannon entropy
It is worthwhile noting that the aforementioned inequalities remain valid for distributions normalized to unity, which is the choice that it is employed throughout this work for the 3-dimensional molecular case.
Aside of the analysis of the position and momentum information measures, we have considered it useful to study these magnitudes in the product
From the above two equations, it is clear that the features and patterns of both
We have also evaluated some reactivity parameters that may be useful to analyze the chemical reactivity of the aminoacids. So that, we have computed several reactivity properties such as the ionization potential (IP), the hardness (η) and the electrophilicity index (ω). These properties were obtained at the Hartree-Fock level of theory (HF) in order to employ the Koopmans' theorem (Koopmans, 1933; Janak, 1978), for relating the first vertical ionization energy and the electron affinity to the HOMO and LUMO energies, which are necessary to calculate the conceptual DFT properties. Parr and Pearson, proposed a quantitative definition of hardness (η) within conceptual DFT (Parr & Yang, 1989):
where ε denotes the frontier molecular orbital energies and
although it has been recently disowned (Ayer et al. 2006: Pearson, 1995). In general terms, the chemical hardness and softness are good descriptors of chemical reactivity. The former has been employed (Ayer et al. 2006: Pearson, 1995; Geerlings et al., 2003) as a measure of the reactivity of a molecule in the sense of the resistance to changes in the electron distribution of the system, i.e., molecules with larger values of
The electrophilicity index (Parr et al., 1999),
The electrophilicity is also a good descriptor of chemical reactivity, which quantifies the global electrophilic power of the molecules -predisposition to acquire an additional electronic charge- (Parr & Yang, 1989).
The exact origin of homochirality is one of the great unanswered questions in evolutionary science; such that, the homochirality in molecules has remained as a mystery for many years ago, since Pasteur. Any biological system is mostly composed of homochiral molecules; therefore, the most well-known examples of homochirality is the fact that natural proteins are composed of L-amino acids, whereas nucleic acids (RNA or DNA) are composed of D-sugars (Root-Bernstein, 2007; Werner, 2009; Viedma et al., 2008). The reason for this behavior continues to be a mystery. Until today not satisfactory explanations have been provided regarding the origin of the homochirality of biological systems; since, the homochirality of the amino acids is critical to their function in the proteins. If proteins (with L-aminoacids) had a non-homochiral behavior (with few D-enantiomers in random positions) they would not present biological functionality It is interesting to mention that L-aminoacids can be synthesized by use of specific enzymes, however, in prebiotic life these processes remain unknown. The same problem exists for sugars which have the D configuration. (Hein and Blackmond, 2011; Zehnacker et al., 2008; Nanda and DeGrado, 2004).
On the other hand, the natural amino acids contain one or more asymmetric carbon atoms, except the glycine. Therefore, the molecules are two nonsuperposable mirror images of each other; i.e., representing right-handed (D enantiomer) and left-handed (L enantiomer) structures. It is considered that the equal amounts of D- and L- amino acids existed on primal earth before the emergence of life. Although the chemical and physical properties of L-and D amino acids are extremely similar except for their optical character, the reason of the exclusion of D-amino acids and why all living organisms are now composed predominantly of L-amino acids are not well-known: however, the homochirality is essential for the development and maintenance of life (Breslow, 2011; Fujii et al., 2010; Tamura, 2008). The essential property of α-aminoacids is to form linear polymers capable of folding into 3-dimensional structures, which form catalytic active sites that are essential for life. In the procees, aminoacids behave as hetero bifunctional molecules, forming polymers via head to tail linkage. In contrast, industrial nylons are often prepared from pairs of homo-bifunctional molecules (such as diamines and dicarboxylic acids), the use of a single molecule containing both linkable functionalities is somewhat simpler (Cleaves, 2010; Weber and Miller, 1981; Hicks, 2002).
The concept of chirality in chemistry is of paramount interest because living systems are formed of chiral molecules of biochemistry is chiral (Proteins, DNA, amino acids, sugars and many natural products such as steroids, hormones, and pheromones possess chirality). Indeed, amino acids are largely found to be homochiral (Stryer, 1995) in the L form. On the other hand, most biological receptors and membranes are chiral, many drugs, herbicides, pesticides and other biological agents must themselves possess chirality. Synthetic processes ordinarily produce a 50:50 (racemic) mixture of left-handed and right-handed molecules (so-called enantiomers), and often the two enantiomers behave differently in a biological system.
On the other hand, a major topic of research has been to study the origin of homochirality. In this respect, biomembranes have played an important role for the homochiraility of biopolymers. One of the most intriguing problems in life sciences is the mechanism of symmetry breaking. Many theories have been proposed on these topics and in the attempt to explain the amplification of a first enantiomeric imbalance to the enantiopurity of biomolecules (Bombelli et al., 2004). In all theories on symmetry breaking and on enantiomeric excess amplification little attention has been paid to the possible role of biomembranes, or of simple self-aggregated systems that may have acted as primitive biomembranes. Nevertheless, it is possible that amphiphilic boundary systems, which are considered by many scientists as intimately connected to the emergence and the development of life (Avalos et al. 2000; Bachmann et al., 1992), had played a role in the history of homochirality in virtue of recognition and compartmentalization phenomena (Menger and Angelova, 1998). In general, the major reason for the different recognition of two enantiomers by biological cells is the homochirality of biomolecules such as L-amino acids and D-sugars. The diastereomeric interaction between the enantiomers of a bioactive compound and the receptor formed from a chiral protein can cause different physiological responses. The production technology of enantiomerically enriched bioactive compounds one of the most important topics in chemistry. There is great interest in how and when biomolecules achieved high enantioenrichment, including the origin of chirality from the standpoint of chiral chemistry (Zehnacker et al., 2008; Breslow, 2011; Fujii et al., 2010; Tamura, 2008; Arnett and Thompson, 1981)
3.1. Physical and information-theoretical properties
Figure l illustrates a Venn diagram (Livingstone & Barton, 1993; Betts & Russell, 2003) which is contained within a boundary that symbolizes the universal set of 20 common amino acids (in one letter code). The amino acids that possess the dominant properties—hydrophobic, polar and small (< 60 Å3)—are defined by their set boundaries. Subsets contain amino acids with the properties aliphatic (branched sidechain non-polar), aromatic, charged, positive, negative and tiny (<35 Å 3). Shaded areas define sets of properties possessed by none of the common amino acids. For instance, cysteine occurs at two different positions in the Venn diagram. When participating in a disulphide bridge (CS-S), cysteine exhibits the properties 'hydrophobic' and 'small'. In addition to these properties, the reduced form (CS-H) shows polar character and fits the criteria for membership of the 'tiny' set. Hence, the Venn diagram (Figure l) assigns multiple properties to each amino acid; thus lysine has the property hydrophobic by virtue of its long sidechain as well as the properties polar, positive and charged. Alternative property tables may also be defined. For example, the amino acids might simply be grouped into non-intersecting sets labelled, hydrophobic, charged and neutral.
In order to perform a theoretical-information analysis of L- and D-aminoacids we have employed the corresponfing L-enantiomers reported in the Protein Data Bank (PDB), which provide a standard representation for macromolecular structure data derived from X-ray diffraction and NMR studies. In a second stage, the D-type enantiomers were obtained from the L-aminoacids by interchanging the corresponding functional groups (carboxyl and amino) of the α-carbon so as to represent the D-configuration of the chiral center, provided that steric impediments are taken into account. The latter is achieved by employing the Ramachandran (Ramachandran et al, 1963) map, which represent the phi-psi torsion angles for all residues in the aminoacid structure to avoid the steric hindrance. Hence, the backbone of all of the studied aminoacids represent possible biological structures within the allowed regions of the Ramachandran. In the third stage, an electronic structure optimization of the geometry was performed on all the enantiomers for the twenty essential aminoacids so as to obtain structures of minimum energy which preserve the backbone (see above). In the last stage, all of the information-theoretic measures were calculated by use of a suite of programs which have been discussed elsewhere (Esquivel et al., 2012).
In Figures 2 through 4 we have depicted some selected information-theoretical measures and complexities in position space versus the number of electrons and the energy. For instance, it might be observed from Fig. 2 that the Shannon entropy increases with the number of electrons so that interesting properties can be observed, e.g., the aromatic ones possess more delocalized densities as the rest of the aminoacids (see Figure 1B) which confer specific chemical properties. On the other hand, the disequilibrium diminishes as the number of electron increases (see Fig. 2), which can be related to the chemical stability of the aminoacids, e.g., cysteine and metionine show the larger values (see Fig. 2) which is in agreement with the biological evidence in that both molecules play mutiple functions in proteins, chemical as well as structural, conferring the higher reactivity that is recognized to both molecules. In contrast, aromatic aminoacids (see Fig 1B) are the least reactive, which is in agreement with the lower disequilibrium values that are observed form Fig 2.
In Figures 3 we have plotted the
In Figures 5 through 8 we have analyzed the homochiral behavior of all aminoacids by plotting the difference between the L and the D values of several physical properties (energy, ionization potential, hardness, electrophilicity) and some relevant information-theoretical measures (Shannon entropy, Fisher, LMC- and FS-complexity). From Figures 5 and 6 one can readily observe that none of the physical properties studied in this work show a uniform enantiomeric behavior, i.e., it is not possible to distinguish the L-aminoacids from the D-ones by using an specific physical property. In contrast, the L-aminoacids can be uniquely characterize d from the D-ones when informatic-theoretical measures are employed (see Figures 7 and 8) and this is perhaps the most interesting result obtained from our work. To the best of our kowledge no similar observations have been reported elsewhere, showing strong evidence of the utility of Information Theory tools for decoding the essential blocks of life.
4. Genetic code
The genetic code refers to a nearly universal assignment of codons of nucleotides to amino acids. The codon to amino acid assignment is realized through: (i) the code adaptor molecules of transfer RNAs (tRNAs) with a codon’s complementary replica (anticodon) and the corresponding amino acid attached to the 3’ end, and (ii) aminoacyl tRNA synthetases (aaRSs), the enzymes that actually recognize and connect proper amino acid and tRNAs. The origin of the genetic code is an inherently difficult problem (Crick, 1976). Taking into a count that the events determining the genetic code took place long time ago, and due to the relative compactness of the present genetic code. The degeneracy of the genetic code implies that one or more similar tRNA can recognize the same codon on a messenger mRNA. The number of amino acids and codons is fixed to 20 amino acids and 64 codons (4 nucleotides, A.C.U.G per three of each codon) but the number of tRNA genes varies widely 29 to 126 even between closely related organisms. The frequency of synonymous codon use differs between organisms, within genomes, and along genes, a phenomenon known as CUB (codon usage bias) (Thiele et al., 2011).
Sequences of bases in the coding strand of DNA or in messenger RNA possess coded instructions for building protein chains out of amino acids. There are 20 amino acids used in making proteins, but only four different bases to be used to code for them. Obviously one base can't code for one amino acid. That would leave 16 amino acids with no codes. By taking two bases to code for each amino acid, that would still only give you 16 possible codes (TT, TC, TA, TG, CT, CC, CA and so on) – that is, still not enough. However, by taking three bases per amino acid, that gives you 64 codes (TTT, TTC, TTA, TTG, TCT, TCC and so on). That's enough to code for everything with lots to spare. You will find a full table of these below. A three base sequence in DNA or RNA is known as a codon.
The codes in the coding strand of DNA and in messenger RNA aren't, of course, identical, because in RNA the base uracil (U) is used instead of thymine (T). Table 1 shows how the various combinations of three bases in the coding strand of DNA are used to code for individual amino acids - shown by their three letter abbreviation. The table is arranged in such a way that it is easy to find any particular combination you want. It is fairly obvious how it works and, in any case, it doesn't take very long just to scan through the table to find what you want. The colours are to stress the fact that most of the amino acids have more than one code. Look, for example, at leucine in the first column. There are six different codons all of which will eventually produce a leucine (Leu) in the protein chain. There are also six for serine (Ser). In fact there are only two amino acids which have only one sequence of bases to code for them - methionine (Met) and tryptophan (Trp). Note that three codons don't have an amino acid but "stop" instead. For obvious reasons these are known as stop codons. The stop codons in the RNA table (UAA, UAG and UGA) serve as a signal that the end of the chain has been reached during protein synthesis. The codon that marks the start of a protein chain is AUG, that's the amino acid, methionine (Met). That ought to mean that every protein chain must start with methionine.
4.2. Physical and information-theoretical properties
An important goal of the present study is to characterize the biological units which codify aminoacids by means of information-theoretical properties. To accomplished the latter we have depicted in Figures 9 through 13 the Shannon entropy, Disequilibrium, Fisher and the LMC and FS complexities in position space as the number of electron increases, for the group of the 64 codons. A general observation is that all codons hold similar values for all these properties as judging for the small interval values of each graph. For instance, the Shannon entropy values for the aminoacids (see Figure 2) lie between 4.4 to 5.6, whereas the corresponding values for the codons (see Figure 9) lie between 6.66 to 6.82, therefore this information measure serves to characterize all these bilogical molecules, providing in this way the first benchmark informational results for the building blocks of life. Further, it is interesting to note from Figures 9 and 10 that entropy increases with the number of electrons (Fig. 9) whereas the opposite behavior is observed for the Disequilibrium measure. Besides, we may note from these Figures an interesting codification pattern within each isolelectronic group of codons where one may note that an exchange of one nucleotide seems to occur, e.g., as the entropy increases in the 440 electron group the following sequence is found: UUU to (UUC, UCU, CUU) to (UCC, CUC, CCU) to CCC. Similar observations can be obtained from Figures 10 and 11 for D and I, respectively. In particular, Fisher information deserves special analysis, see Figure 11, from which one may observe a more intricated behavior in which all codons seem to be linked across the plot, i.e., note that for each isoelectronic group codonds exchange only one nucleotide, e.g., in the 440 group codons change from UUU to (UUC, UCU, CUU) to (UCC, CUC, CCU) to CCC as the Fisher measure decreaes. Besides, as the Fisher measure and the number of electrons increase linearly a similar exchange is observed, eg., from AAA to (AAG, AGA, GAA) to (AGG, GAG, GGA) to GGG. We believe that the above observations deserve further studies since a codification pattern seems to be apparent.
In Figures 12 and 13 we have depicted the LMC and FS complexities, respectively, where we can note that as the number of electron increases the LMC complexity decreases and the opposite is observed for the FS complexity. It is worth mentioning that similar codification patternsm, as the ones above discussed, are observed for both complexities. Furthermore, we have found interesting to show similar plots in Figures 14 and 15 where the behavior of both complexities is shown with respect to the total energy. It is observed that as the energy increases (negatively) the LMC complexity decreases whereas the FS complexity increases. Note that similar codification patterns are observed in Figure 15 for the FS complexity.
5. Concluding remarks
We have shown throughout this Chapter that information-theoretical description of the fundamental biological pieces of the genetic code: aminoacids and codons, can be analysed in a simple fashion by employing Information Theory concepts such as local and global information measures and statistical complexity concepts. In particular, we have provided for the first time in the literature with benchmark information-theoretical values for the 20 essential aminacids and the 64 codons for the nucleotide triplets. Throughout these studies, we believe that information science may conform a new scientific language to explain essential aspects of biological phenomena. These new aspects are not accessible through any other standard methodology in quantum chemistry, allowing to reveal intrincated mechanisms in which chemical phenomena occur. This envisions a new area of research that looks very promising as a standalone and robust science. The purpose of this research is to provide fertile soil to build this nascent scientific area of chemical and biological inquiry through information-theoretical concepts towards the science of the so called Quantum Information Biology.
We wish to thank José María Pérez-Jordá and Miroslav Kohout for kindly providing with their numerical codes. We acknowledge financial support through Mexican grants from CONACyT, PIFI, PROMEP-SEP and Spanish grants MICINN projects FIS2011-24540, FQM-4643 and P06-FQM-2445 of Junta de Andalucía. J.A., J.C.A., R.O.E. belong to the Andalusian researchs groups FQM-020 and J.S.D. to FQM-0207. R.O.E. wishes to acknowledge financial support from the CIE-2012. CSC., acknowledges financial support through PAPIIT-DGAPA, UNAM grant IN117311. Allocation of supercomputing time from Laboratorio de Supercómputo y Visualización at UAM, Sección de Supercomputacion at CSIRC Universidad de Granada, and Departamento de Supercómputo at DGSCA-UNAM is gratefully acknowledged.