The experimental molecular archeology analysis using ancestral proteins
The diversity of life on Earth is the result of perpetual evolutionary processes beginning at life’s origins; evolution is the fundamental development strategy of life. Today, studies of gene and protein sequences, including various genome-sequencing projects, provide insight into these evolutionary processes and events. However, the sequence data obtained is restricted to extant genes and proteins, with the exception of the rare fossil genome samples [1, 2], for example Neanderthal , archaic hominin in Siberia [4, 5], and ancient elephants such as mastodon and mammoth . The fossil record, and genome sequences derived from it, has the potential to elucidate ancient, extinct forms of life, acting as missing links to fill evolutionary gaps; however, the sequenced fossil genome is very limited, mainly due to the condition of samples and the challenges of preparing them. Discovering the forms of ancient organisms is one of the major purposes of paleontology, and is valuable in understanding of current life forms as these will be a reflection of their evolutionary history. However, the reconstruction of a living organism from fossils, which would be the ultimate paleontological methodology, is far beyond the currently available technologies, although there has recently been a report of the production of an artificial bacterial cell, using a chemically synthesized genome .
Meanwhile, for genes or the proteins they encode, it is already feasible to reconstruct their ancestral forms using phylogenetic trees constructed from sequence data; these techniques may well provide clues to the evolutionary history of certain extant genes and proteins with respect to their ancestors. Although phylogenetic analyses alone, or in combination with protein structure simulations, are useful to analyze structure-function relationships and evolutionary history , resurrected ancient recombinant proteins have the potential to provide more direct observations. Production of ancestral or ancient proteins can be achieved comparably easily due to developments in molecular biology and protein engineering techniques, which allow nucleotide or amino acid sequences to be synthesized. Ancestral proteins can be tested in the laboratory using biochemical or biophysical methods, for their activity, stability, specificity, and even three-dimensional structure. Thus, ancestral sequence reconstruction (ASR) has proved a useful experimental tool for studying the diverse structure and function of proteins . To date, such ‘experimental molecular archeology’ using ASR has been applied to several enzymes [10-24], including photo-reactive proteins [25-37], nuclear receptor and transmembrane proteins [38-48], lectins [49-52], viral proteins [53, 54], elongation factor [55-57], paralbmin , in addition to a number of peptides [59,60] (Table 1).
In early studies, ASR experiments using the technique of molecular phylogeny were based on basic site-directed mutagenesis and used to investigate the functional evolution of proteins, including the convergent evolution of lysozyme in ruminant stomach environments and the adaptation of enzymes to alkaline conditions [10-12]. However, if ancestral sequences have been determined, the most straightforward method is to reconstruct the full-length ancestral protein in the laboratory. No fundamental differences exist between ancestor reconstruction and standard site-directed mutagenesis, other than the number of amino acids residues requiring mutation, which, in the case of ancestor reconstruction, might be spread over the entire sequence. At present, ASR can be achieved using commercially available
2. The early studies: Reconstruction of partial ancestors by site-directed mutagenesis
The first studies exploiting the idea of ancestral protein reconstruction used site-direct mutagenesis, in which a small number of amino acids were substituted to produce the anticipated ancestral status. These studies include the reconstruction of a ribonuclease (RNase) of an extinct bovid ruminant [10, 11], and the lysozymes from a game-bird using ancestral lysozyme reconstructions predicted by the MP (Maximum Parsimony) method . Benner and colleagues reconstructed RNase of an extinct bovid ruminant , by predicting four sequences of ancestral RNases from five closely related bovids including ox, swamp buffalo, river buffalo, nilgai, and the primitive artiodactyl using the MP method [61, 62]. The ancestor closest to the extant ox protein was selected from the four probable ancestors as the target of the experiment as it contained a mutation of amino acid residue 35, located close to Lys41, which is known to be important for catalysis. Three ancestral mutants of the ox RNase (A19S, L35M, and A19S/L35M) were examined for their kinetic properties and the thermal stabilities against tryptic digestion. However, no significant difference was found between the ancestral mutants and the modern ox RNase. The results suggested that these amino acid substitutions were evolutionarily neutral, although this conclusion is limited to the extent of the examined properties .
Malcolm et al. succeeded in identifying a non-neutral evolutionary pathway of game-bird lysozymes using ancestral lysozyme reconstructions predicted by the MP method . Seven mutations in game-bird lysozyme proteins included combinations of residues Thr40, Ile55, and Ser91, which were anticipated to be Ser40, Val55 and Thr91, respectively, in ancestral molecules. The mutants were synthesized as possible intermediates in the evolutionary pathway of bird lysozyme and comparative molecular properties and crystal structures of these revealed that the thermostabilities of the proteins were correlated with the bulkiness of their side chains. The T40S mutant increased its thermostability by more than 3°C, allowing the conclusion that this mutation was non-neutral effect of natural selection.
Yamagishi and colleagues used ancestral protein reconstruction [14-16] to obtain direct evidence for the hypothesis that the common ancestor of all organisms was hyper-thermophilic . Because the catalytic activities of 3-isopropylmalate dehydrogenase (IPMDH) and isocitrate dehydrogenase (ICDH) are similar to one another and their three-dimensional structures conserved, these proteins are diverged from an ancient common ancestor , of which sequence was inferred from a phylogenetic tree constructed from IPMDH and ICDH sequences from various species, including the thermophile (
Recently, Whittington and Moerland reported that ASR analysis of parvalbumins (PVs) was able to identify the set of substitutions most likely to have caused a significant shift in PV function during the evolution of
These studies were performed by introducing a limited number of mutations into extant proteins, or by carefully selecting ancestors that were separated from an extant protein by only few substitutions. However, such ancestral reconstruction by site-directed mutagenesis appears to be incomplete, as the possibility that sites remaining in a non-ancestral state may significantly affect the molecular property of interest, cannot be ruled out. Although it is difficult and expensive to introduce many mutations into sites widely distributed over gene sequences by site-directed mutagenesis,
3. Methods for ancestral sequence prediction
How can we determine the sequences of ancestral proteins or genes? In most cases, since the ancestral genes do not currently exist, the ancestral sequences need to be estimated and reconstructed mainly
By contrast, the ML method, which does not require this assumption, is currently more widely used. This method evaluates the posterior probability of a nucleotide/amino acid residue at each node of a phylogenetic tree, based on empirical Bayesian statistics, using the provided sequences and a substitution probability matrix as inputs (observations). Therefore, results can be significantly affected by the choice of input sequences and the choice of substitution probability matrix; the probability of a reconstructed sequence at a node might be low when the node is connected to the provided sequences through longer and/or more intervening branches. The ML method is popular in the field, largely owing to the presence of the excellent software package PAML . Several other software applications have been also developed for this purpose, such as FastML , ANSESCON , and GASP . With the exception of GASP which partly employs the MP method to enable ancestral state prediction at gapped sites in a sequence alignment, these applications are based on the ML method. In many cases, ancestral sequences cannot be unambiguously determined, and several amino acids might be assigned to a residue site with almost equal probabilities. To avoid false conclusions as a result of such ambiguity, the accuracy of reconstructed ancestral sequence is critical for such studies. However, it is often difficult to obtain a complete, highly accurate sequence, as molecular evolution is believed to be a highly stochastic process and there is no guarantee that ancestral sequences can be identified without errors. Even if each residue of a protein made up of 100 residues, is identified with posterior probability of 0.99 (ie. 99% are expected to be correct), the probability that the sequence as a whole is accurate is only ~0.37 (i.e., 0.99100). In many actual cases, site probabilities are likely to be much lower. This is a major problem in ancestor reconstruction studies, and considerable efforts have been made to avoid incorrect conclusions due to imperfect reconstruction.
Williams et al reported the assessment of the accuracy of ancestral protein reconstruction by MP, ML and Bayesian inference (BI) methods . Their results indicated that MP and ML methods, which reconstruct "best guess" amino acids at each position, overestimate thermostability, while the BI method, which sometimes chooses less-probable residues from the posterior probability distribution, does not. ML and MP tend to eliminate variants at a position that are slightly detrimental to structural stability, simply because such detrimental variants are less frequent. Thus, Williams et al caution that ancestral reconstruction studies require greater care to come to credible conclusions regarding functional evolution . Thornton and colleagues also examined simulation-based experiments, under both simplified and empirically derived conditions, to compare the accuracy of ASR carried out using ML and Bayesian approaches . They showed that incorporating phylogenetic uncertainty by integrating over topologies very rarely changes the inferred ancestral state and does not improve the accuracy of the reconstructed ancestral sequence, suggesting that ML can produce accurate ASRs, even in the face of phylogenetic uncertainty, and using Bayesian integration to incorporate the uncertainty is neither necessary nor beneficial .
In the case for experimental molecular archeology using ASR, the effects of equally probable residues at unreliable sites have been tested by site-directed mutagenesis to confirm directly that molecular properties are not largely affected by these. Indeed, in the case of ancestral congerin genes, the single mutant Con-anc’-N28K, in which the suspicious site was replaced with the alternate suggested amino acid was reconstructed in addition to the ancestral congerin (Con-anc’, the last common ancestor of ConI and ConII) inferred from the phylogeny of extant galectins using the ML method based on DNA sequences . Nucleotide sequences were retrieved from the DDBJ database  , and the ancestral sequence were inferred using the PAML program . The alignment of amino acid sequences of the extant galectins was first prepared using the XCED program , and an alignment of the corresponding nucleotide sequences was made in accordance with the amino acid sequence alignment. Tree topology was based on the amino acid sequences of extant proteins using the neighbor-joining (NJ) method. PAML was applied to the phylogeny and alignment to infer the ancestral sequences. The F1X4 matrix was used as the codon substitution model with the universal codon table. The free
In the case of alcohol dehydrogenase (Adh) ancestral mutants reported by Thomson et al., the posterior probability of the sequence predicted by the ML method was found to be low at three sites. Amino acid residues 168, 211 and 236 of Adh had two (Met and Arg), three (Lys, Arg and Thr), and two (Asp and Asn) equally probable candidates as the ancestral residues, respectively. Therefore, all possible combinations (2 x 3 x 2 = 12) of the candidates at the ambiguous sites were reproduced, and their kinetic properties assessed . The results confirmed with consistency among the alternative mutants that acetaldehyde metabolism was the original function of Adh, that ancestral yeast could not consume ethanol, and that the function of ethanol metabolism was most likely acquired in the linage of the Adh2 locus after gene duplication.
4. Reconstruction of full-length ancestral proteins: Selective adaptive evolution of Conger eel galectins
Conger eel galectins, termed Congerins I and II (Con I and Con II), function as biodefense molecules in the skin mucus and frontier organs including the epidermal club cells of the skin, wall of the oral cavity, pharynx, esophagus, and gills [76-79]. Con I and Con II are prototype galectins, composed of subunits containing 135 and 136 amino acids, respectively, and display 48% amino acid sequence identity . While both Con I and Con II form 2-fold symmetric homodimers with 5- and 6-stranded β-sheets (termed a jellyroll motif), they have different stabilities and carbohydrate-binding specificities, although they do have the conserved carbohydrate recognition domain (CRD) common to other galectins [81-84]. Previous studies of Con I and Con II, based on molecular evolutionary and X-ray crystallography analyses, revealed that these proteins have evolved via accelerated substitutions under natural selection pressure [74-85].
To understand the rapid adaptive differentiation of congerins, experimental molecular archaeology analysis, using the reconstructed ancestral congerins, Con-anc and Con-anc’, and their mutants has been conducted [49-51]. Since the ancestral sequences of congerin, Con-anc and Con-anc’, were estimated from different phylogenetic trees, which were constructed from the varying numbers of extant genes available (eight for Con-anc, and sixteen for Con-anc’) (Fig. 1A), the ancestral sequence Con-anc’ showed a 27% discrepancy from the previously inferred sequence of Con-anc (Fig. 1B). Furthermore, as described in the ‘Methods for Ancestral Sequence Prediction’ section, the reproduction rate of each Con-anc’ amino acid residue was examined for the reconstructed sequences, with one extant gene excluded for each estimation, in order to identify highly unstable sites. The result indicated that the average reproduction rate over the sequences were 0.98, and only one residue, Asn28 of Con-anc’, was reproduced with a distinguishably low rate of 0.286, prompting verification of the results by the construction of a single mutant Con-anc’-N28K. The revised ancestral congerins, Con-anc’ or Con-anc’-N28K, were attached to the nodes of extant proteins with zero distance in the phylogeny constructed from amino acid sequences, indicating that the sequence was appropriate for that of an ancestor (Fig. 1A). On the other hand, the previously inferred Con-anc was attached midway on the ConI branch. Therefore, Con-anc’ or Con-anc’-N28K are likely to be closer to the true common ancestor of ConI and ConII than Con-anc. The structures and molecular properties of congerins, as discussed below, also supported this conclusion.
Although Con-anc is an ancestral mutant located midway on the ConI branch and shares a higher sequence similarity with ConI (76%) than with ConII (61%), it showed unique carbohydrate-binding activity and properties, and more closely resembled ConII than ConI, in terms of thermostability and carbohydrate recognition specificity, with the exception of carbohydrates containing α2, 3-sialyl galactose, for example GM3 and GD1a. The ancestral congerins, Con-anc’ and Con-anc’-N28K, demonstrated similar carbohydrate binding activity and specificities to those of Con-anc . These analyses of Con-anc suggested a functional evolutionary process for ConI, where it evolved from the ancestral congerin to increase its structural stability and sugar-binding activity. In the case of the ancestral congerin, Con-anc, the candidate amino acid residues responsible for the higher structural stability and carbohydrate-binding activity of Con I were reduced to only 31 amino acid residues, from a total of 71 with apparent differences between Con I and Con II. These were mainly located in the N- and C-terminal and loop regions of the molecule, including the CRD [49, 50]. To identify the residues responsible for the properties of Con I, we next performed molecular evolution tracing analysis, by constructing pseudo-ancestral Con-anc proteins focused on the N-terminal, C-terminal, and some loop regions (loops 3, 5 and 6) .
This is a protein engineering approach where a proportion of amino acid residues of an extant protein are substituted with those of an ancestor, to construct pseudo-ancestors, in order to reveal the residues determining functional differences between extant and ancestral proteins. These molecular evolutional approaches using pseudo-ancestors bridged from Con-anc to ConI successfully elucidated the regions of the protein relevant to the two adaptive features of ConI, thermostability and higher carbohydrate-binding activity . Experimental molecular archeology analysis, using the reconstructed ancestral congerins, also revealed the process of ConII evolution, another extant galectin. ConII has evolved to enhance affinity for α2, 3-sialyl galactose, which is specifically present in pathogenic marine bacteria. The selection pressure to which Con II reacted was hypothesized to be a shift in carbohydrate affinity. The observed difference in α2, 3-sialyl galactose affinities between Con-anc and Con II support this hypothesis.
The crystal structures of ancestral full-length proteins, Con-anc’, Con-anc’-N28K and Con-anc, have been solved at 1.5, 1.6, and 2.0 Å resolutions, respectively . Their three-dimensional (3D) structures clearly demonstrate that Con-anc’ or Con-anc’-N28K are appropriate ancestors of extant congerins (Fig. 2). A notable difference between the structures of ConI and ConII is the swapping of S1 strands at the dimer interface, which is unique to ConI among known galectins, and should contribute to its higher stability . The dimer interface of ancestral Con-anc’ and Con-anc’-N28K resembled that of ConII, but lacking the strand-swap. This protein-fold is the prototype for dimeric galectins, and the congerin ancestor is expected to have ConII-like conformation. Conversely, Con-anc did display a strand-swapped structure, indicating it was more likely to be an intermediate from the ancestor to ConI, consistent with the results of phylogeny construction (Fig. 2). The differences in carbohydrate interactions between Con-anc’ and the extant congerins were observed mainly at the A-face of galactose . These modifications might be relevant to the observed differentiation of carbohydrate specificities between ConI and ConII; ConI prefers α1,4-fucosylated
Taken together, the first full-length ancestral structures of congerin revealed that the duplicated genes have been differentiating under natural selection pressures for strengthening of the dimer structure and enhancement of the cytotoxic activity. However, the two genes did not react equally to selection pressure, with ConI reacting through protein-fold evolution to enhance its stability. The modification of the dimer interface in the ConII lineage was rather moderate.
5. Reconstruction of ancestral proteins: thermal adaptation of proteins in thermophilic bacterium
Ancestral mutant analysis has been performed to explore the thermal adaptation of proteins. Benner and colleagues reconstructed the ancestral elongation factor-Tu (EF-Tu) predicted using ML methodology, in order to infer the physical environment surrounding ancient organisms . Because EFs play a crucial role in protein synthesis in cells, the thermostability of EFs shows a strong correlation with the optimal growth temperature of their host organisms. For example, the melting temperatures (
Yamagishi and coworkers reported several ancestral proteins, including two metabolic enzymes; 3-isopropylmalate and dehydrogenase (IPMDH), which is involved in leucine biosynthesis, and isocitrate dehydrogenase (ICDH) involved in the TCA cycle. Ancestral amino acids were introduced into extant IPMDH sequences of the hyperthermophilic archaeon
More recently, Hobbs et al reported the reconstruction of several common Precambrian ancestors of the core metabolic enzyme LeuB, 3-isopropylmalate dehydrogenase, estimated from various
Overall, these studies demonstrate that ancestral enzymes retained enzymatic activity and acquired enhanced thermostability over respective extant enzymes, and that introduction of ancestral state amino acids into modern proteins frequently thermostabilizes them. This indicates that ancestral protein reconstruction can provide empirical access to the evolution of ancient phenotypes, and is useful as a strategy for thermostabilization protein engineering.
6. Reconstruction of ancestral proteins: Evolutionary history of nuclear receptors and visual pigment proteins
Thornton and colleagues have reported seminal work using ancestral protein reconstructions of the nuclear receptors for steroid hormones to investigate evolution of their ligand specificities [38-47, 88, 89]. Vertebrates have six homologous nuclear receptors for steroid hormones; the estrogen receptors alpha and beta (ERα and ERβ), androgen receptor (AR), progesterone receptor (PR), glucocorticoid receptor (GR), and mineralocorticoid receptor (MR). As it is thought that these proteins evolved from a common ancestor through a series of gene duplications , the reconstruction of their ancestral proteins is a useful tool for investigation of their evolution of ligand-specificity. Although GR and MR are close relatives, GR is activated only by the stress hormone cortisol in most vertebrates, while MR is activated by both aldosterone and cortisol [90, 91]. The amino acid substitutions responsible for the specificity of GR toward cortisol were identified by reconstruction studies of the common ancestor of GRs and MRs using ML methodology [38-47]. Thornton and colleagues also reconstructed the ancestral corticoid receptor (AncCR), which corresponded to the protein predicted to be formed at the duplication event between GR and MR genes. Functional analysis showed that AncCR could be activated by both aldosterone and cortisol, suggesting that GR of vertebrates had lost aldosterone specificity during the evolutionary process. Furthermore, site-direct mutagenesis and X-ray crystallographic studies of AncCR revealed that amino acid substitutions at S106P and L111Q were key for the specificity shift of GR [38, 39]. AncCR is the first complete domain ancestor (ligand-binding domain only), for which 3D structure was determined. Ancestral mutant analysis of the NR5 nuclear orphan receptors, including steroidogenic factor 1 (SF-1) and liver receptor homolog 1 (LRH-1) was also reported . The structure-function relationships of the SF-1/LRH-1 subfamily and their evolutionary ligand-binding shift, where the characteristic phospholipid binding ability of the SF-1/LRH-1 subfamily was subsequently reduced and lost in the lineage leading to the rodent LRH-1, due to specific amino acid replacements, were elucidated .
Reconstruction of visual pigment proteins, including rhodopsin and green fluorescent protein (GFP)-like proteins, has been also conducted. Chang et al reconstructed an ancestral archosaur rhodopsin from thirty vertebrate species using the ML method and three different models; nucleotide-, amino acid-, and codon-based . An ancestral protein can be reconstructed with each of these models and the inferred archosaur rhodopsin had the same amino acid sequences for all three, except for three amino acid sites (positions 213, 217, and 218), and all reconstructed ancestral proteins had four variants at the ambiguous sites (single mutants T213I, T217A, V218I, and the triple mutant of these) showed similar optical properties, with an apparent absorption maximum at 508–509 nm, slightly red-shifted from that of modern vertebrates (482–507 nm). These data indicated that the alternative ancestral amino acids predicted by the different likelihood models showed similar functional characteristics. Dim-light and color vision in vertebrates are controlled by five visual pigments (RH1, RH2, SWS1, SWS2, and M/LWS), each consisting of a protein moiety (opsin) and a covalently bound 11-cis-retinal (or 11-
The great star coral
7. Concluding remarks
Experimental molecular archaeology using ASR is a new and potentially useful method not only for the study of molecular evolution, but also as a protein engineering technique. This method can provide us with experimental information about ancient genes and proteins, which cannot be obtained from fossil records or by simply constructing molecular phylogeny. However, as discussed above, ancestral sequences can have some issues with ambiguity, depending on the choice of evaluation method, evolutionary model, and sequences. Although inference methods such as MP, ML and BI can lead to errors in predicted ancestral sequences, resulting in potentially misleading estimates of the properties of the ancestral protein, experimental molecular archaeology using ASR could be a more reliable method as all possible ancestral mutants, in which ambiguous amino acid sites are replaced by equally probable candidates individually or in combination, are reproducible and the biological and physicochemical properties and 3D structures of the molecules can be assessed. Indeed, when ancestral congerins were reconstructed based on insufficient sequence information lacking recently determined fish galectin genes, the ancestral Con-anc protein was shown to have a strand-swapped structure resembling ConI, indicating that Con-anc was more likely to be an intermediate mutant of the ancestor to ConI, and that the revised Con-anc’ or Con-anc’-N28K are more appropriate ancestors. Thus, the accuracy of ASR can be assessed by analysis of protein activities, stabilities, specificities, and even 3D structures in the laboratory using biochemical or biophysical methods.
Experimental molecular archaeology using ASR can be applied to more complex biological systems, such as heterologous subunit interactions and their evolution in molecular machines , host-viral interactions and their co-evolution [54, 94, 95], and proteome/structural proteome level analyses . Furthermore, recent studies have indicated that ASR is applicable to not only to proteins, but also to nucleotides including ancestral rRNA  and transposons . To understand the molecular strategies of evolution in nature and the structure-function relationships of proteins and nucleotides, it is important to learn more from ‘nature’ itself, and from its prodigious works and histories; proteins/nucleotides and their molecular evolution.