Bioinformatics has its origins in the development of DNA sequencing methods by Alan Maxam and Walter Gilbert (Maxam and Gilbert, 1977), and by Frederick Sanger and coworkers (Sanger et al., 1977). By entirely different approaches, the first genomes determined at the nucleotide sequence level were that of bacteriophage ϕX174, and the recombinant plasmid named pBR322 composed of about 5,400 ( Sanger et al., 1977 ), or 4,400 base pairs (Sutcliffe, 1979), respectively. In contrast, two articles that appeared in February 2001 reported on the preliminary DNA sequence of the human genome, which corresponds to 3 billion nucleotides of DNA sequence information (Lander et al., 2001; Venter et al., 2001). Only two years later, the GenBank sequence database contained more than 29.3 billion nucleotide bases in greater than 23 million sequences. With the development of new technologies, experts predict that the cost to sequence an individual’s DNA will be about $1000. This reduction in cost suggests that efforts in the area of comparative genomics will increase substantially, leading to an enormous database that vastly exceeds the existing one.
By way of comparative genomics approaches, computational methods have led to the identification of homologous genes shared among species, and their classification into superfamilies based on amino acid sequence similarity. In combination with their evolutionary relatedness, superfamily members have been clustered into clades. In addition, high throughput sequencing of small RNAs and bioinformatics analyses have contributed to the identification of regions between genes that can code small RNAs (siRNA, microRNA, and long noncoding RNA), which act during the development of an organism to modulate gene expression at the post-transcriptional level (Fire et al., 1998; Hamilton and Baulcombe, 1999) reviewed in Elbashir et al., 2001; Ghildiyal and Zamore, 2009; Christensen et al., 2010). An emerging area is functional genomics whereby gene function is deduced using large-scale methods by identifying the involvement of specific genes in metabolic pathways. More recently, phenotype microarray methods have been used to correlate the functions of genes of microbes with cell phenotypes under a variety of growth conditions (Bochner, 2009).
These methods contrast with the traditional approach of mapping a gene via the phenotype of a mutation, and deducing the function of the gene product based on its biochemical analysis in concert with physiological studies. Such studies have been performed to confirm the functional importance of conserved residues shared by superfamily members, and also to determine the role of specific residues for a given protein. In comparison, comparative genomics methods are unable to distinguish if a nonconserved amino acid among superfamily members is functionally important, or simply reflects sequence divergence due to the absence of selection during evolution. Without functional information, it is not possible to determine if a nonconserved amino acid is important.
2. Bioinformatics analysis of AAA+ proteins
On the basis of bioinformatics analysis, P-loop nucleotide hydrolases compose a very large group of proteins that use an amino acid motif named the phosphate binding loop (P-loop) to hydrolyze the phosphate ester bonds of nucleotides. A positively charged group in the side chain of an amino acid (often lysine) in the P-loop promotes nucleotide hydrolysis by interacting with the phosphate of the bound nucleotide. Additional bioinformatics analysis of this group of proteins led to a category of nucleotidases containing the Walker A and B motifs, as well as additional motifs shared by the AAA (ATPases Associated with diverse cellular Activities) superfamily (Beyer, 1997; Swaffield and Purugganan, 1997). These diverse activities include protein unfolding and degradation, vesicle transport and membrane fusion, transcription and DNA replication. The additional motifs of the AAA superfamily differentiate its members from the larger set of P-loop nucleotidases. Neuwald et al., and Iyer et al. then integrated structural information with bioinformatics analysis to classify members of the AAA+ superfamily into clades (Neuwald et al., 1999; Iyer et al., 2004). These clades are the clamp loader clade, the DnaA/CDC6/ORC clade, the classical AAA clade, the HslU/ClpX/Lon/ClpAB-C clade, and the Helix-2 insert clade. The last two clades have been organized into the Pre-sensor 1 hairpin superclade.
Members of the superfamily of AAA+ ATPases carry a nucleotide-binding pocket called the AAA+ domain that ranges from 200 to 250 amino acids, which is formed by an αβα-type Rossmann fold followed by several α helices (Figure 1) (Lupas and Martin, 2002; Iyer et al., 2004; Hanson and Whiteheart, 2005). Such proteins often assemble into ring-shaped or helical oligomers ( Davey et al., 2002 ; Iyer et al., 2004; Erzberger and Berger, 2006 ). Using the nomenclature of Iyer et al., the Rossmann fold is formed by a β sheet of parallel strands arranged in a β5-β1-β4-β3-β2 series. Its structure resembles a wedge. An α helix preceding the β1 strand and a loop that is situated across the face of the β sheet is a distinguishing feature of the AAA+ superfamily. Another characteristic is the position of several α helices positioned above the wide end of the wedge. The P-loop or the Walker A motif (GX4GKT/S where X is any amino acid) is located between the β1 strand and the following α helix. The Walker B motif (ϕϕϕϕDE where ϕ is a hydrophobic amino acid) coordinates a magnesium ion complexed with the nucleoside triphosphate via the conserved aspartate residue. The conserved glutamate is thought to interact with a water molecule to make it a better nucleophile for nucleotide hydrolysis.
AAA+ proteins also share conserved motifs named the Sensor 1, Box VII, and Sensor 2 motifs that coordinate ATP hydrolysis with a change in conformation (Figure 1) (Lupas and Martin, 2002; Iyer et al., 2004; Hanson and Whiteheart, 2005). Relative to the primary amino acid sequence, these motifs are on the C-terminal side of the Walker B motif. The Sensor 1 motif contains a polar amino acid at the end of the β4 strand. On the basis of the X-ray crystal structure of N-ethylmaleimide-sensitive factor, an ATPase involved in intracellular vesicle fusion (Beyer, 1997; Swaffield and Purugganan, 1997), this amino acid together with the acidic residue in the Walker B motif interacts with and aligns the activated water molecule during nucleotide hydrolysis. The Box VII motif, which is also called the SRH (Second Region of Homology) motif, contains an arginine named the arginine finger by its analogous function with the corresponding arginine of GTPase activator proteins that interacts with GTP bound to a small G protein partner to promote GTP hydrolysis. The crystal structures of several AAA+ proteins have shown that the Box VII motif in an individual molecule is located some distance away from the nucleotide binding pocket. In AAA+ proteins that assemble into ring-shaped or helical oligomers, the Box VII motif of one protomer directs an arginine residue responsible for interaction with the γ phosphate of ATP toward the ATP binding pocket of the neighboring molecule. It is proposed that this interaction or lack thereof coordinates ATP hydrolysis with a conformational change. The Sensor 2 motif, which resides in one of the α helices that follow the Rossmann fold, also contains a conserved arginine. For proteins whose structures contain the bound nucleoside triphosphate or a nucleotide analogue, this amino acid interacts with the γ phosphate of the nucleotide. As reviewed by Ogura (Ogura et al., 2004), this residue is involved in ATP binding or its hydrolysis in some but not all AAA+ proteins. Like the arginine finger residue, this arginine is thought to coordinate a change in protein conformation with nucleotide hydrolysis.
Because this chapter focuses on members of the DnaA/CDC6/ORC or initiator clade, the following summarizes properties of this clade and not others. Like the clamp loader clade, proteins in the initiator clade as represented by DnaA and DnaC have a structure resembling an open spiral on the basis of X-ray crystallography ( Erzberger et al., 2006 ; Mott et al., 2008). In comparison, oligomeric proteins in the remaining clades form closed rings. A characteristic feature of proteins in the initiator clade is the presence of two α helices between the β2 and β3 strands (Figure 1). Compared with the function of DnaA in the initiation of E. coli DNA replication, DnaC plays a separate role. Their functions are described in more detail below. The ORC/CDC6 group of eukaryotic proteins in the initiator clade, like DnaA and DnaC, act to recruit the replicative helicase to replication origins at the stage of initiation of DNA replication (Lee and Bell, 2000; Liu et al., 2000). The origin recognition complex (ORC) is composed of six related proteins named Orc1p through Orc6p, and likely originated along with Cdc6p from a common ancestral gene.
Bioinformatics analysis of DnaC suggests that this protein is a paralog of DnaA, arising by gene duplication and then diverging with time to perform a separate role from DnaA during the initiation of DNA replication (Koonin, 1992). This notion leads to the question of what specific amino acids are responsible for the different functions of DnaA and DnaC despite the shared presence of the AAA+ amino acid sequence motifs. Presumably, specific amino acids that are not conserved between these two proteins have critical roles in determining their different functions, but how are these residues identified and distinguished from those that are not functionally important? In addition, some amino acids that are conserved among homologous DnaC proteins, which were identified by multiple sequence alignment of twenty-eight homologues (Figure 2), are presumable responsible for the unique activities of DnaC, but what are these unique activities? These issues underscore the limitation of deducing the biological function of protein by relying only on bioinformatics analysis.
3. Reverse genetics as an approach to identify the function of an unknown gene
Using various amino acid sequence alignment methods for a particular gene, the postulated function for this gene remains unknown if amino acid sequence homology is not obtained relative to a gene of known function. In such cases, the general approach is to employ reverse genetics to attempt to correlate a phenotype with a mutation in the gene. By way of comparison, forward genetics begins with a phenotype caused by a specific mutation at an unknown site in the genome. The approximate position of the gene can be determined by classical genetic methods that involve its linkage to another mutation that gives rise to a separate phenotype. Refined linkage mapping can localize the gene of interest, followed by PCR (polymerase chain reaction) amplification of the region and DNA sequence analysis to determine the nature of the mutation. As a recent development, whole genome sequencing has been performed to map mutations, dispensing with the classical method of genetic linkage mapping (Lupski et al., 2010; Ng and Kirkness, 2010). The DNA sequence obtained may reveal that the gene and the corresponding gene product have been characterized in the same or different organism, and disclose its physiological function.
In a reverse genetics approach with a haploid organism, the standard strategy is to inactivate the gene with the hope that a phenotype can be measured. Inactivation can be achieved either by deleting the gene or by insertional mutagenesis, usually with a transposon. As examples, transposon mutagenesis has been performed with numerous microbial species, and with Caenorhabditis elegans (Vidan and Snyder, 2001; Moerman and Barstead, 2008; Reznikoff and Winterberg, 2008). Using E. coli or S. cerevisiae as model organisms for gene disruption, one method relies on replacing most of the gene with a drug resistance cassette, or a gene that causes a detectable phenotype. The technique of gene disruption relies on homologous recombination in which the drug resistance gene, for example, has been joined to DNA sequences that are homologous to the ends of the target gene (Figure 3). After introduction of this DNA into the cell, recombination between the ends of the transfected DNA and the homologous regions in the chromosome leads to
replacement of the chromosomal copy of the gene with the drug resistance cassette, after which the excised copy of the chromosomal gene is lost. In both E. coli and S. cerevisiae, this approach has been used in seeking to correlate a phenotype with genes of unknown function, and to identify those that are essential for viability (Winzeler et al., 1999; Baba et al., 2006). By either gene disruption or transposon mutagenesis, genetic mapping of the mutation can be performed by inverse PCR where primers complementary to a sequence near the ends of the drug resistance cassette or the transposon are used. This approach first involves partially digesting the chromosomal DNA with a restriction enzyme followed by ligation of the resulting fragments to form a collection of circular DNAs. DNA sequence analysis of the amplified DNA with the primers described above followed by comparison of the nucleotide sequence with the genomic DNA sequence can identify the site of the disrupted gene, or the site of insertion of the transposon.
With a multicellular organism, a similar strategy that relies on homologous recombination is used to delete a gene. The type of cell to introduce the deletion is usually an embryonic stem cell so that the effect of deletion can be measured in the whole organism. Many eukaryotic organisms have two complete sets of chromosomes. Because the process of homologous recombination introduces the deletion mutation in one of the two pairs of chromosomes, yielding a heterozygote, the presence of the wild type copy on the sister chromosome may conceal the biological effect of the deletion. Thus, the ideal objective is to delete both copies of a gene in order to measure the associated phenotype. To attempt to obtain an organism in which both copies of a gene have been “knocked out,” the standard strategy is to mate heterozygous individuals. By Mendelian genetics, one-fourth of the progeny should carry the deletion in both copies of the gene. The drawback with the approach of deleting a gene is that it may be essential for viability as suggested if a homozygous knockout organism cannot be obtained. Another pitfall is that it may not be possible to construct a heterozygous knockout because the single wild type copy is insufficient to maintain viability. In either case, no other hint of gene function is obtained except for the requirement for life.
Another complication with attempting to determine the role of a eukaryotic gene by deleting it is the existence of gene families where a specific biochemical function is provided by allelic variants. Hence, to correlate a phenotype by introducing a mutation into a specific allelic variant requires inactivation of all other members of the family. A further complication with eukaryotic organisms is that a product formed by an enzyme of interest in one biochemical pathway may be synthesized via an alternate pathway that involves a different set of proteins. In these circumstances, deletion of the gene does not yield a measurable phenotype.
In the event that deletion of a gene is not possible, an alternate approach to characterize the function of an unknown gene is by RNA interference (reviewed in Carthew and Sontheimer, 2009; Fischer, 2010; Krol et al., 2010). This strategy exploits a natural process that acts to repress the expression of genes during development, or as cells progress through the cell cycle (Fire et al., 1998; Ketting et al., 1999; Tabara et al., 1999). Small RNA molecules named microRNA (miRNA) and small interfering RNA (siRNA) become incorporated into a large complex called the RNA-inducing silencing complex (RISC), which reduces the expression of target genes by facilitating the annealing of the RNA with the complementary sequence in a messenger RNA (Liu et al., 2003). The duplex RNA is recognized by a component of the RISC complex, followed by degradation of the messenger RNA to block its expression. The RNA interference pathway has been adapted as a method to reduce or “knockdown” the expression of a specific gene in order to explore its physiological function. Compared with other genetic methods that examine the effect of a specific amino acid substitution on a particular activity of a multifunctional protein, the knockout and knockdown approaches are not as refined in that they measure the physiological effect of either the reduced function, or the loss of function of the entire protein.
4. E. coli as a model organism for structure-function studies
Escherichia coli is a rod-shaped bacterium (0.5 micron x 2 microns in the nongrowing state) that harbors a 4.4 x 106 base pair genome encoding more than 4,000 genes. By transposon-based insertional mutagenesis and independently by systematic deletion of each open reading frame, these genes have been separated into those that are essential for viability, and those that are considered nonessential (Baba et al., 2006). Of the total, about 300 genes are of undetermined function, including 37 genes that are essential. BLAST analysis indicates that some of the genes of unknown function are conserved among bacteria, suggesting their functional importance.
In comparison, many of the genes of known function have been studied extensively. Among these are the genes required for duplication of the bacterial chromosome, including a subset that acts at the initiation stage of DNA replication. The following section describes a specific example that focuses on DnaC protein. Studies on this protein take advantage of bioinformatics in combination with its X-ray crystallographic structure, molecular genetic analysis, and the biochemical characterization of specific mutant DnaC proteins to obtain new insight into its role in DNA replication.
5. Molecular analysis of E. coli DnaC, an essential protein involved in the initiation of DNA replication, and replication fork restart
DNA replication is the basis for life. Occurring only once per cell cycle, DNA replication must be tightly coordinated with other major cellular processes required for cell growth so that each progeny cell receives an accurate copy of the genome at cell division (reviewed in DePamphilis and Bell, 2010). Improper coordination of DNA replication with cell growth leads to aberrant cell division that causes cell death in severe cases. In addition, the failure to control the initiation process leads to excessive initiations, followed by the production of double strand breaks that apparently arise due to head-to-tail fork collisions. In eukaryotes, aneuploidy and chromosome fusions appear if the broken DNA is not fixed that can lead to malignant growth.
In bacteria, chromosomal DNA replication starts at a specific locus called oriC (Figure 4).
Recent reviews describe the independent mechanisms that control the frequency of initiation from this site (Nielsen and Lobner-Olesen, 2008; Katayama et al., 2010). In Escherichia coli, the minimal oriC sequence of 245 base pairs contains DNA-binding sites for many different proteins that either act directly in DNA replication, or modulate the frequency of this process (reviewed in Leonard and Grimwade, 2009). One of them is DnaA, which is the initiator of DNA replication, and has been placed in one of the clades of the AAA+ superfamily via bioinformatics analysis (Koonin, 1992; Erzberger and Berger, 2006 ). DnaA binds to a consensus 9 base pair sequence known as the DnaA box. There are five DnaA boxes individually named R1 through R5 within oriC that are similar in sequence and are recognized by DnaA (Leonard and Grimwade, 2009). In addition to these sites, DnaA complexed to ATP specifically recognizes three I- sites and τ-sites in oriC, which leads to the unwinding of three AT-rich 13-mer repeats located in the left half of oriC. Binding sites are also present for IHF protein (integration host factor) and FIS protein (factor for inversion stimulation). As these proteins induce bends in DNA, their apparent ability to modulate the binding of DnaA to the respective sites in oriC may involve DNA bending. Additionally, oriC carries 11 GATC sequences recognized by DNA adenine methyltransferase, and sites recognized by IciA, Rob, and SeqA. The influence of IHF, FIS, IciA, Rob and SeqA proteins on the initiation process is described in more detail in a review (Leonard and Grimwade, 2005).
At the initiation stage of DNA replication, the first step requires the binding of DnaA molecules, each complexed to ATP, to the five DnaA boxes, I- and τ- sites of oriC. After binding, DnaA unwinds the duplex DNA in the AT-rich region to form an intermediate named the open complex. HU or IHF stimulates the formation of the open complex. In the next step, the replicative helicase named DnaB becomes stably bound to the separated strands of the open complex to form an intermediate named the prepriming complex. At this juncture, DnaC must be complexed to DnaB for a single DnaB hexamer to load onto each of the separated strands. DnaC protein must then dissociate from the complex in order for DnaB to be active as a helicase. Following the loading of DnaB, this helicase enlarges the unwound region of oriC, and then interacts with DnaG primase (Tougu and Marians, 1996). This interaction between DnaB and DnaG primase, which synthesize primer RNAs that are extended by DNA polymerase III holoenzyme during semi-conservative DNA replication, marks the transition between the process of initiation and elongation stage of DNA replication (McHenry, 2003; Corn and Berger, 2006; Langston et al., 2009). Replication fork movement that is supported by DnaB helicase and assisted by a second helicase named Rep proceeds bidirectionally around the chromosome until it reaches the terminus region (Guy et al., 2009). The two progeny DNAs then segregate near opposite poles of the cell before septum formation and cell division.
DnaC protein (27 kDa) is essential for cell viability because it is required during the initiation stage of DNA replication (reviewed in Kornberg and Baker, 1992; Davey and O'Donnell, 2003). DnaC is also required for DNA replication of the single stranded DNA of phage ϕX174, and for many plasmids (e.g. pSC101, P1, R1). DnaC additionally acts to resurrect collapsed replication forks that appear when a replication fork encounters a nick, gap, double-stranded break, or modified bases in the parental DNA (Sandler et al., 1996). This process of restarting a replication fork involves assembly of the replication restart primosome that contains PriA, PriB, PriC, DnaT, DnaB, DnaC, and Rep protein (Sandler, 2005; Gabbai and Marians, 2010). The major roles of DnaC at oriC, at the replication origins of the plasmids and bacteriophage described above, or in restarting collapsed replication forks is to form a complex with DnaB, which is required to escort the helicase onto the DNA, and then to depart. Since the discovery of the dnaC gene over 40 years ago (Carl, 1970), its ongoing study by various laboratories using a variety of approaches continue to reveal new aspects of the molecular mechanisms of DnaC in DNA replication.
Biochemical analysis combined with the X-ray crystallographic structure of the majority of Aquifex aeolicus DnaC (residues 43 to the C-terminal residue at position 235) reveals that DnaC protein consists of a smaller N-terminal domain that is responsible for binding to the C- terminal face of DnaB helicase, and larger ATP-binding region of 190 amino acids (Figure 2; (Ludlam et al., 2001; Galletto et al., 2003; Mott et al., 2008)). Sequence comparison of homologues of the dnaC gene classifies DnaC as a member of the AAA+ family of ATPases (Koonin, 1992; Davey et al., 2002; Mott et al., 2008). However unlike other AAA+ proteins, DnaC contains two additional α helices named the ISM motif (Initiator/loader–Specific Motif) that directs the oligomerization of this protein into a right-handed helical filament (Mott et al., 2008). In contrast, the majority AAA+ proteins lacking these α helices assemble into a closed-ring. Phylogenetic analysis of the AAA+ domain reveals that DnaC is most closely related to DnaA, suggesting that both proteins arose from a common ancestor (Koonin, 1992). In support, the X-ray crystallographic structures of the ATPase region of DnaA and DnaC are very similar ( Erzberger et al., 2006 ; Mott et al., 2008).
For DnaC, ATP increases its affinity for single-stranded DNA, which stimulates its ATPase activity ( Davey et al., 2002 ; Biswas et al., 2004). Other results suggest that ATP stabilizes the interaction of DnaC with DnaB in the DnaB-DnaC complex (Wahle et al., 1989; Allen and Kornberg, 1991), which contradicts studies that support the contrary conclusion that ATP is not necessary for DnaC to form a stable complex with DnaB ( Davey et al., 2002 ; Galletto et al., 2003; Biswas and Biswas-Fiss, 2006). As mutant DnaC proteins bearing amino acid substitutions in the Walker A box are both defective in ATP binding and apparently fail to interact with DnaB, the consequence is that these mutants cannot escort DnaB to oriC (Ludlam et al., 2001; Davey et al., 2002 ). Hence, despite the ability of DnaB by itself to bind to single-stranded DNA in vitro (LeBowitz and McMacken, 1986), DnaC is essential for DnaB to become stably bound to the unwound region of oriC (Kobori and Kornberg, 1982; Ludlam et al., 2001). The observation that DnaC complexed to ATP interacts with DnaA raises the possibility that both proteins act jointly in helicase loading (Mott et al., 2008). Together, these observations indicate that the ability of DnaC to bind to ATP is essential for its function in DNA replication, but the paradox about the role of ATP binding and its hydrolysis on the activity of DnaC and about the mechanism that leads to the dissociation of DnaC from DnaB have been long-standing issues.
As described above, one of the characteristics of AAA+ proteins is the presence of a conserved motif named Box VII, which carries a conserved arginine called the “arginine finger”. Structural studies of other AAA+ proteins have led to the proposal that this arginine interacts with the γ phosphate of ATP to promote and coordinate ATP hydrolysis with a conformational change. Recent experiments were performed to examine the role of the arginine finger of DnaC and to attempt to clarify how ATP binding and its hydrolysis by DnaC are involved in the process of initiation of DNA replication (Makowska-Grzyska and Kaguni, 2010). Part of this study relied on an E. coli mutant lacking the chromosomal copy of the dnaC gene (Hupert-Kocurek et al., 2007). Because the dnaC gene is essential, this deficiency of the host strain can be complemented by a plasmid encoding the dnaC gene that depends on IPTG (isopropyl β-D-1-thiogalactopyranoside) for plasmid maintenance (Figure 5). If another plasmid is introduced into the null dnaC mutant, it maintains viability of the strain in the absence of IPTG only if it carries a functional dnaC allele. In contrast, if the second plasmid carries an inactivating mutation in dnaC, the host strain cannot survive without IPTG. This plasmid exchange method showed that an alanine substitution for the arginine finger residue inactivated DnaC (Makowska-Grzyska and Kaguni, 2010). Biochemical experiments performed in parallel showed that this conserved arginine plays a role in the signal transduction process that involves ATP hydrolysis by DnaC that then leads to the release of DnaC from DnaB. Finally, the interaction of primase with DnaB that is coupled with primer formation is also apparently necessary for DnaC to dissociate from DnaB.
In summary, the combination of various experimental approaches on the study of DnaC have led to insightful experiments that expand our understanding of the role of ATP binding and its hydrolysis by DnaC during the initiation of DNA replication. Evidence suggests that ATP hydrolysis by DnaC that leads to the dissociation of DnaC from DnaB helicase is coupled with primer formation that requires an interaction between DnaG primase and DnaB. Hence, these critical steps are involved in the transition from the process of initiation to the elongation phase of DNA replication in E. coli.
This example on the molecular mechanism of DnaC protein is a focused study of one protein and its interaction with other required proteins during the process of initiation of DNA replication. One may consider this a form of vertical thinking. It contrasts with bioinformatics approaches that yield large sets of data for proteins based on the DNA sequences of genomes, and with microarray approaches that, for example, survey the expression of genes and their regulation at the genome level under different conditions, or identify interacting partners for a specific protein. The vast wealth of data from these global approaches provide a different perspective on understanding the functions of sets of genes or proteins and how they act in a network of biochemical pathways of the cell.
We thank members of our labs for discussions on the content and organization of this chapter. This work was supported by a grant GM090063 from the National Institutes of Health, and the Michigan Agricultural Station to JMK.