Bioinformatics has its origins in the development of DNA sequencing methods by Alan Maxam and Walter Gilbert (Maxam and Gilbert, 1977), and by Frederick Sanger and coworkers (Sanger et al., 1977). By entirely different approaches, the first genomes determined at the nucleotide sequence level were that of bacteriophage ϕX174, and the recombinant plasmid named pBR322 composed of about 5,400 ( Sanger et al., 1977 ), or 4,400 base pairs (Sutcliffe, 1979), respectively. In contrast, two articles that appeared in February 2001 reported on the preliminary DNA sequence of the human genome, which corresponds to 3 billion nucleotides of DNA sequence information (Lander et al., 2001; Venter et al., 2001). Only two years later, the GenBank sequence database contained more than 29.3 billion nucleotide bases in greater than 23 million sequences. With the development of new technologies, experts predict that the cost to sequence an individual’s DNA will be about $1000. This reduction in cost suggests that efforts in the area of comparative genomics will increase substantially, leading to an enormous database that vastly exceeds the existing one.
By way of comparative genomics approaches, computational methods have led to the identification of homologous genes shared among species, and their classification into superfamilies based on amino acid sequence similarity. In combination with their evolutionary relatedness, superfamily members have been clustered into clades. In addition, high throughput sequencing of small RNAs and bioinformatics analyses have contributed to the identification of regions between genes that can code small RNAs (siRNA, microRNA, and long noncoding RNA), which act during the development of an organism to modulate gene expression at the post-transcriptional level (Fire et al., 1998; Hamilton and Baulcombe, 1999) reviewed in Elbashir et al., 2001; Ghildiyal and Zamore, 2009; Christensen et al., 2010). An emerging area is functional genomics whereby gene function is deduced using large-scale methods by identifying the involvement of specific genes in metabolic pathways. More recently, phenotype microarray methods have been used to correlate the functions of genes of microbes with cell phenotypes under a variety of growth conditions (Bochner, 2009).
These methods contrast with the traditional approach of mapping a gene via the phenotype of a mutation, and deducing the function of the gene product based on its biochemical analysis in concert with physiological studies. Such studies have been performed to confirm the functional importance of conserved residues shared by superfamily members, and also to determine the role of specific residues for a given protein. In comparison, comparative genomics methods are unable to distinguish if a nonconserved amino acid among superfamily members is functionally important, or simply reflects sequence divergence due to the absence of selection during evolution. Without functional information, it is not possible to determine if a nonconserved amino acid is important.
2. Bioinformatics analysis of AAA+ proteins
On the basis of bioinformatics analysis, P-loop nucleotide hydrolases compose a very large group of proteins that use an amino acid motif named the phosphate binding loop (P-loop) to hydrolyze the phosphate ester bonds of nucleotides. A positively charged group in the side chain of an amino acid (often lysine) in the P-loop promotes nucleotide hydrolysis by interacting with the phosphate of the bound nucleotide. Additional bioinformatics analysis of this group of proteins led to a category of nucleotidases containing the Walker A and B motifs, as well as additional motifs shared by the AAA (ATPases Associated with diverse cellular Activities) superfamily (Beyer, 1997; Swaffield and Purugganan, 1997). These diverse activities include protein unfolding and degradation, vesicle transport and membrane fusion, transcription and DNA replication. The additional motifs of the AAA superfamily differentiate its members from the larger set of P-loop nucleotidases. Neuwald
Members of the superfamily of AAA+ ATPases carry a nucleotide-binding pocket called the AAA+ domain that ranges from 200 to 250 amino acids, which is formed by an αβα-type Rossmann fold followed by several α helices (Figure 1) (Lupas and Martin, 2002; Iyer et al., 2004; Hanson and Whiteheart, 2005). Such proteins often assemble into ring-shaped or helical oligomers ( Davey et al., 2002 ; Iyer et al., 2004; Erzberger and Berger, 2006 ). Using the nomenclature of Iyer
AAA+ proteins also share conserved motifs named the Sensor 1, Box VII, and Sensor 2 motifs that coordinate ATP hydrolysis with a change in conformation (Figure 1) (Lupas and Martin, 2002; Iyer et al., 2004; Hanson and Whiteheart, 2005). Relative to the primary amino acid sequence, these motifs are on the C-terminal side of the Walker B motif. The Sensor 1 motif contains a polar amino acid at the end of the β4 strand. On the basis of the X-ray crystal structure of N-ethylmaleimide-sensitive factor, an ATPase involved in intracellular vesicle fusion (Beyer, 1997; Swaffield and Purugganan, 1997), this amino acid together with the acidic residue in the Walker B motif interacts with and aligns the activated water molecule during nucleotide hydrolysis. The Box VII motif, which is also called the SRH (Second Region of Homology) motif, contains an arginine named the arginine finger by its analogous function with the corresponding arginine of GTPase activator proteins that interacts with GTP bound to a small G protein partner to promote GTP hydrolysis. The crystal structures of several AAA+ proteins have shown that the Box VII motif in an individual molecule is located some distance away from the nucleotide binding pocket. In AAA+ proteins that assemble into ring-shaped or helical oligomers, the Box VII motif of one protomer directs an arginine residue responsible for interaction with the γ phosphate of ATP toward the ATP binding pocket of the neighboring molecule. It is proposed that this interaction or lack thereof coordinates ATP hydrolysis with a conformational change. The Sensor 2 motif, which resides in one of the α helices that follow the Rossmann fold, also contains a conserved arginine. For proteins whose structures contain the bound nucleoside triphosphate or a nucleotide analogue, this amino acid interacts with the γ phosphate of the nucleotide. As reviewed by Ogura (Ogura et al., 2004), this residue is involved in ATP binding or its hydrolysis in some but not all AAA+ proteins. Like the arginine finger residue, this arginine is thought to coordinate a change in protein conformation with nucleotide hydrolysis.
Because this chapter focuses on members of the DnaA/CDC6/ORC or initiator clade, the following summarizes properties of this clade and not others. Like the clamp loader clade, proteins in the initiator clade as represented by DnaA and DnaC have a structure resembling an open spiral on the basis of X-ray crystallography ( Erzberger et al., 2006 ; Mott et al., 2008). In comparison, oligomeric proteins in the remaining clades form closed rings. A characteristic feature of proteins in the initiator clade is the presence of two α helices between the β2 and β3 strands (Figure 1). Compared with the function of DnaA in the initiation of
Bioinformatics analysis of DnaC suggests that this protein is a paralog of DnaA, arising by gene duplication and then diverging with time to perform a separate role from DnaA during the initiation of DNA replication (Koonin, 1992). This notion leads to the question of what specific amino acids are responsible for the different functions of DnaA and DnaC despite the shared presence of the AAA+ amino acid sequence motifs. Presumably, specific amino acids that are not conserved between these two proteins have critical roles in determining their different functions, but how are these residues identified and distinguished from those that are not functionally important? In addition, some amino acids that are conserved among homologous DnaC proteins, which were identified by multiple sequence alignment of twenty-eight homologues (Figure 2), are presumable responsible for the unique activities of DnaC, but what are these unique activities? These issues underscore the limitation of deducing the biological function of protein by relying only on bioinformatics analysis.
3. Reverse genetics as an approach to identify the function of an unknown gene
Using various amino acid sequence alignment methods for a particular gene, the postulated function for this gene remains unknown if amino acid sequence homology is not obtained relative to a gene of known function. In such cases, the general approach is to employ reverse genetics to attempt to correlate a phenotype with a mutation in the gene. By way of comparison, forward genetics begins with a phenotype caused by a specific mutation at an unknown site in the genome. The approximate position of the gene can be determined by classical genetic methods that involve its linkage to another mutation that gives rise to a separate phenotype. Refined linkage mapping can localize the gene of interest, followed by PCR (polymerase chain reaction) amplification of the region and DNA sequence analysis to determine the nature of the mutation. As a recent development, whole genome sequencing has been performed to map mutations, dispensing with the classical method of genetic linkage mapping (Lupski et al., 2010; Ng and Kirkness, 2010). The DNA sequence obtained may reveal that the gene and the corresponding gene product have been characterized in the same or different organism, and disclose its physiological function.
In a reverse genetics approach with a haploid organism, the standard strategy is to inactivate the gene with the hope that a phenotype can be measured. Inactivation can be achieved either by deleting the gene or by insertional mutagenesis, usually with a transposon. As examples, transposon mutagenesis has been performed with numerous microbial species, and with
replacement of the chromosomal copy of the gene with the drug resistance cassette, after which the excised copy of the chromosomal gene is lost. In both
With a multicellular organism, a similar strategy that relies on homologous recombination is used to delete a gene. The type of cell to introduce the deletion is usually an embryonic stem cell so that the effect of deletion can be measured in the whole organism. Many eukaryotic organisms have two complete sets of chromosomes. Because the process of homologous recombination introduces the deletion mutation in one of the two pairs of chromosomes, yielding a heterozygote, the presence of the wild type copy on the sister chromosome may conceal the biological effect of the deletion. Thus, the ideal objective is to delete both copies of a gene in order to measure the associated phenotype. To attempt to obtain an organism in which both copies of a gene have been “knocked out,” the standard strategy is to mate heterozygous individuals. By Mendelian genetics, one-fourth of the progeny should carry the deletion in both copies of the gene. The drawback with the approach of deleting a gene is that it may be essential for viability as suggested if a homozygous knockout organism cannot be obtained. Another pitfall is that it may not be possible to construct a heterozygous knockout because the single wild type copy is insufficient to maintain viability. In either case, no other hint of gene function is obtained except for the requirement for life.
Another complication with attempting to determine the role of a eukaryotic gene by deleting it is the existence of gene families where a specific biochemical function is provided by allelic variants. Hence, to correlate a phenotype by introducing a mutation into a specific allelic variant requires inactivation of all other members of the family. A further complication with eukaryotic organisms is that a product formed by an enzyme of interest in one biochemical pathway may be synthesized via an alternate pathway that involves a different set of proteins. In these circumstances, deletion of the gene does not yield a measurable phenotype.
In the event that deletion of a gene is not possible, an alternate approach to characterize the function of an unknown gene is by RNA interference (reviewed in Carthew and Sontheimer, 2009; Fischer, 2010; Krol et al., 2010). This strategy exploits a natural process that acts to repress the expression of genes during development, or as cells progress through the cell cycle (Fire et al., 1998; Ketting et al., 1999; Tabara et al., 1999). Small RNA molecules named microRNA (miRNA) and small interfering RNA (siRNA) become incorporated into a large complex called the RNA-inducing silencing complex (RISC), which reduces the expression of target genes by facilitating the annealing of the RNA with the complementary sequence in a messenger RNA (Liu et al., 2003). The duplex RNA is recognized by a component of the RISC complex, followed by degradation of the messenger RNA to block its expression. The RNA interference pathway has been adapted as a method to reduce or “knockdown” the expression of a specific gene in order to explore its physiological function. Compared with other genetic methods that examine the effect of a specific amino acid substitution on a particular activity of a multifunctional protein, the knockout and knockdown approaches are not as refined in that they measure the physiological effect of either the reduced function, or the loss of function of the entire protein.
E. colias a model organism for structure-function studies
In comparison, many of the genes of known function have been studied extensively. Among these are the genes required for duplication of the bacterial chromosome, including a subset that acts at the initiation stage of DNA replication. The following section describes a specific example that focuses on DnaC protein. Studies on this protein take advantage of bioinformatics in combination with its X-ray crystallographic structure, molecular genetic analysis, and the biochemical characterization of specific mutant DnaC proteins to obtain new insight into its role in DNA replication.
5. Molecular analysis of
E. coliDnaC, an essential protein involved in the initiation of DNA replication, and replication fork restart
DNA replication is the basis for life. Occurring only once per cell cycle, DNA replication must be tightly coordinated with other major cellular processes required for cell growth so that each progeny cell receives an accurate copy of the genome at cell division (reviewed in DePamphilis and Bell, 2010). Improper coordination of DNA replication with cell growth leads to aberrant cell division that causes cell death in severe cases. In addition, the failure to control the initiation process leads to excessive initiations, followed by the production of double strand breaks that apparently arise due to head-to-tail fork collisions. In eukaryotes, aneuploidy and chromosome fusions appear if the broken DNA is not fixed that can lead to malignant growth.
In bacteria, chromosomal DNA replication starts at a specific locus called
Recent reviews describe the independent mechanisms that control the frequency of initiation from this site (Nielsen and Lobner-Olesen, 2008; Katayama et al., 2010). In
At the initiation stage of DNA replication, the first step requires the binding of DnaA molecules, each complexed to ATP, to the five DnaA boxes, I- and τ- sites of
DnaC protein (27 kDa) is essential for cell viability because it is required during the initiation stage of DNA replication (reviewed in Kornberg and Baker, 1992; Davey and O'Donnell, 2003). DnaC is also required for DNA replication of the single stranded DNA of phage ϕX174, and for many plasmids (e.g. pSC101, P1, R1). DnaC additionally acts to resurrect collapsed replication forks that appear when a replication fork encounters a nick, gap, double-stranded break, or modified bases in the parental DNA (Sandler et al., 1996). This process of restarting a replication fork involves assembly of the replication restart primosome that contains PriA, PriB, PriC, DnaT, DnaB, DnaC, and Rep protein (Sandler, 2005; Gabbai and Marians, 2010). The major roles of DnaC at
Biochemical analysis combined with the X-ray crystallographic structure of the majority of
For DnaC, ATP increases its affinity for single-stranded DNA, which stimulates its ATPase activity ( Davey et al., 2002 ; Biswas et al., 2004). Other results suggest that ATP stabilizes the interaction of DnaC with DnaB in the DnaB-DnaC complex (Wahle et al., 1989; Allen and Kornberg, 1991), which contradicts studies that support the contrary conclusion that ATP is not necessary for DnaC to form a stable complex with DnaB ( Davey et al., 2002 ; Galletto et al., 2003; Biswas and Biswas-Fiss, 2006). As mutant DnaC proteins bearing amino acid substitutions in the Walker A box are both defective in ATP binding and apparently fail to interact with DnaB, the consequence is that these mutants cannot escort DnaB to
As described above, one of the characteristics of AAA+ proteins is the presence of a conserved motif named Box VII, which carries a conserved arginine called the “arginine finger”. Structural studies of other AAA+ proteins have led to the proposal that this arginine interacts with the γ phosphate of ATP to promote and coordinate ATP hydrolysis with a conformational change. Recent experiments were performed to examine the role of the arginine finger of DnaC and to attempt to clarify how ATP binding and its hydrolysis by DnaC are involved in the process of initiation of DNA replication (Makowska-Grzyska and Kaguni, 2010). Part of this study relied on an
In summary, the combination of various experimental approaches on the study of DnaC have led to insightful experiments that expand our understanding of the role of ATP binding and its hydrolysis by DnaC during the initiation of DNA replication. Evidence suggests that ATP hydrolysis by DnaC that leads to the dissociation of DnaC from DnaB helicase is coupled with primer formation that requires an interaction between DnaG primase and DnaB. Hence, these critical steps are involved in the transition from the process of initiation to the elongation phase of DNA replication in
This example on the molecular mechanism of DnaC protein is a focused study of one protein and its interaction with other required proteins during the process of initiation of DNA replication. One may consider this a form of vertical thinking. It contrasts with bioinformatics approaches that yield large sets of data for proteins based on the DNA sequences of genomes, and with microarray approaches that, for example, survey the expression of genes and their regulation at the genome level under different conditions, or identify interacting partners for a specific protein. The vast wealth of data from these global approaches provide a different perspective on understanding the functions of sets of genes or proteins and how they act in a network of biochemical pathways of the cell.
We thank members of our labs for discussions on the content and organization of this chapter. This work was supported by a grant GM090063 from the National Institutes of Health, and the Michigan Agricultural Station to JMK.