In 1950, Chargaff experimentally found that nucleotides of G and C (or T and A) have the same abundance values when analyzing two DNA strands together (Chargaff, 1950). Three years later, Watson and Crick (1953) published the DNA double helix model and the base-pairing rule in the model explained such equivalent frequencies. This is called the first rule of Chargaff or parity rule 1. Surprisingly, Lin and Chargaff (1967) observed approximately equivalent frequencies of complementary nucleotides within each single DNA strand. This is called the second rule of Chargaff or parity rule 2. The rule 2 is theoretically explained as follows. When mutation and selection are symmetric with respect to the two strands of DNA, parity rule 1 holds the following six pairs of substitution rate to be equal, rGC=rCG, rTA=rAT, rGA=rCT, rAG=rTC, rCA=rGT, rAC=rTG, where, rGC means the substitution rate of G to C in a specific strand and so on (Lobry, 1995). Having the six pairs of equal substitution rates, it is formally derived that complementary nucleotides within each strand have the same occurrence frequencies. Indeed, parity rule 2 only exists when there are not any strand biases of mutation or selection. Therefore, parity rule 2 is a natural derivation of parity rule 1 at the equilibrium state between two strands. And any deviation from parity rule 2 implies substitutional strand biases: the result of different mutations (and or repair) rates, different selective pressures, or both, between the two strands of DNA (Lobry and Sueoka, 2002). In the past two decades, these deviations from intra-strand equimolarities have been extensively studied in eukaryotes (Niu et al., 2003) and their organelles (Krishnan et al. 2004), viruses (Mrazek and Karlin, 1998), particularly in bacteria and archaea (Necsulea and Lobry, 2007). In bacteria, the observed deviations switch sign at the origin and terminus of replication. This chapter reviews the subject of strand-specific composition bias in bacterial genomes, varying strength of it in different species, the underlying mechanisms and the analyzing methods.
2. Strand-specific composition bias in bacterial genomes
2.1. Strand-specific substitution and composition biases
DNA replication is a semi-conservative process (Rocha, 2004). The two strands of the parental duplex are separated, and each serves as a template for the synthesis of a new partner strand. The parental duplex is replaced with two daughter duplexes, each of which consists of one parental strand and one newly synthesized strand. Because of the duplex structure of the parental strands, one daughter strand would be synthesized in a 5' -> 3' direction and the other would have to be copied in a 3' -> 5' direction. However, DNA polymerases can only catalyze synthesis in the 5'-3' direction. Thus, the 5'->3' strand (known as the Leading strand), is continuously synthesized. For the 5'->3' strand (known as the lagging strand), the solution is addressed by adopting discontinuous synthesis. That is to say, lagging strand replication proceeds through the synthesis of relatively short polynucleotide segments (Okazaki fragments) that are then joined together to form a continuous strand (Rocha, 2004).
As mentioned above, the deviations from parity rule 2 observed in bacteria switch sign at the origin and terminus of replication. That is to say, the substitution bias occurs between the two replicating strands, namely leading and laggings strands. There are two major ways for studying asymmetric substitutions: observation of rate bias of substitutions between homologous sequences and direct detection of composition deviations from parity rule 2 (Frank and Lobry, 1999).
Wu and Maeda (1987) used the first method to test for asymmetric substitution in certain regions of chromosomal sequences from six primates. They obtained homologous sequences of the beta-globin complex for the six primates and then calculated the substitution matrix. After comparing the substitution rates of complementary nucleotides, they obtained the first observation of strand asymmetry. The sequence comparisons even allowed them to make predictions about the positions of replication origins. But in later studies, the examination of longer sequences (Bulmer, 1991) did not show the existence of strand asymmetry. Francino et al. (1996) used the same method to investigate asymmetric substitution in the bacterium
The substitution bias could be reflected by the different occurrence frequencies of the four nucleotides between the two strands. The second method builds on the analysis of the DNA sequences for deviations from A=T and G=C. Such deviations in SV 40 were found to have a polarity switch at the origin of replication and thus were taken as evidence for asymmetric mutation in the replication process (Filipski, 1990). The strand nucleotide composition bias was then found in genomes of echinoderm and vertebrate mitochondria (Asakawa et al., 1991). Strand composition biases were observed in the genome of
2.2. Methods used to elucidate the bias and to predict replication origins
For un-annotated bacterial genomes, information on the localization of the replication origin is not available. Therefore, it is unknown whether a gene is located on the leading or lagging strands and quantitative results as in Table 1 could not be obtained. In this circumstance, the strand composition biases, i.e. deviations from parity rule 2, are usually studied by graphical methods. GC-skew (and or AT-skew), cumulative GC-skew and Z curve are three such methods.
GC skews were first used to study mitochondrial strand asymmetry and then widely used to bacterial genomes (Lobry, 1996). The GC skew calculation is performed by the following equation:
where G and C denote the occurrences of the corresponding bases in a given sequence with given length. The skew values along a long sequence were studied often by using a sliding window. The window length is fixed and two adjacent windows may overlap partly in some cases. Take the chromosomal position as horizontal axis and the vertical axis denotes the skew value, a line chart could be drawn. In that way, a GC skew plot for
Although the window-based GC skew method is extensively used, the proper window size is hard to adjust. Such plots may not always be very illustrative due to many visible fluctuations for a small window size, while larger windows may hide precise coordinates of polarity switches. Therefore, an optimal window size does not exist in many cases. To address this point, a more convenient skew diagram was later proposed by Grigoriev (1998). He suggested to calculate directly the sum of (G–C)/(G+C) in adjacent windows from an arbitrary start to a given point in a sequence. Although this method is based on a sliding window, the diagram of cumulative GC skew tends to be smoother because it adopts the form of a sum. To avoid the dependence on the window size w and chromosome length c, Grigoriev (1998) suggested that the cumulative skew values are multiplied by w/c. A cumulative skew diagram for
TA skew or cumulative TA skew could be calculated and plotted by replacing the symbol G by T and C by A in the equation (1). Similarly, keto-amino or purine-pyrimidine skew may be obtained by making appropriate replacements.
Both GC skew and cumulative GC skew are based on sliding windows. The Z curve is one method that thoroughly gets rid of sliding window. We briefly describe the Z curve method
as follows. The Z curve is a three dimensional space curve constituting the unique representation of a given DNA sequence in the sense that for the curve or for the sequence each can be uniquely reconstructed from the other (Zhang and Zhang, 2003). Consider a DNA sequence read from the 5’-end to the 3’-end with N bases, inspecting the sequence one base at one time, and beginning with the first base. The number of inspecting steps could be denoted by n, i.e., n =1, 2,..., N. In the
Xn= (An + Gn) – (Cn + Tn) Rn – Yn
yn= ( An + Cn) – (Gn + Tn) Mn – Kn
zn= ( An + Tn) – ( Cn + Gn) Wn – Sn
n = 0,1,2,……, N, xn, yn, zn[-N,N],
where A0 = C0 = G0 = T0 = 0 and hence x0 = y0 = z0 = 0. The symbols R, Y, M, K, W, and S represent the puRines, pYrimidines, aMino, Keto, Weak hydrogen bonds and Strong hydrogen bonds, respectively, according to the Recommendation 1984 by the NC-IUB (Cornish-Bowden, 1984). The connection of the nodes P0 (P0 = 0), P1, P2,..., until PN one by one sequentially by straight lines is called the Z curve for the DNA sequences inspected. The Z curve defined above is a 3-D space curve, having three independent components, i.e., xn, yn and zn (Zhang and Zhang, 2003).
When being used for predicting replication origin or studying strand composition bias, only the x and y components of the 3-D Z curve are involved (Guo and Yu, 2007). According to equation (1), the x component curve denotes the plus of cumulative excess of G over C and A over T. Whereas, the y component curve represents the opposite number of the plus of cumulative excess of G over C and T over A. In short, the x component denotes the cumulative excess of purine over pyrimidine and the y component means the opposite number of cumulative excess of keto over amino. As an example, the x and y component curves for the
According to analyses on bacterial genomes with experimentally replication origin, skew or Z curve plots for almost all of them inflect the sign or polarity at the sites of replication origins. This is the result of different nucleotide composition biases between the two replicating strands. Based on that fact, replication origins may be putatively predicted by using such methods in newly sequenced bacterial genomes. Indeed, during the annotation process for most of sequenced prokaryotes, replication origins were identified by using one, two or all three of these methods. Therefore, theoretically predicting replication origins is one of the practical applications from the universal phenomenon of strand composition bias in bacterial genomes.
2.3. Consistent direction and varying strengths of strand composition bias
Almost all of the literatures reporting significant strand composition bias revealed an excess of G over C in the leading strands in bacterial genomes. However, C over G excess in the leading strand is very rarely observed. Necsulea and Lobry (2007) performed a thorough analysis of base skew in 360 sequenced bacterial genomes. In this work, they investigated the direction or sign of bias between complementary nucleotides. Table 2 summarizes their results. Among 360 bacteria, only 33 chromosomes show no significant effect of replication. The absence of direct replication effects on base composition bias seems to be more frequent in certain bacterial families, such as
However, the strength of specific composition biases varies from genome to genome in bacteria. Rocha (2004) once used one quantitative method to evaluate the strength of strand composition bias in 58 completely sequenced prokaryotes. The accuracy of the discrimination of the leading strand genes and proteins based on their nucleotide compositions is employed as the index measuring strand bias. If there are no composition biases between the two strands, the expected accuracy is about 50%. According to their results,
Prior to Rocha, the different nucleotide compositions between genes on the two replicating strands of
Among the 11 bacteria with extremely strong strand composition bias, the observations for three are from our group:
Here, Figures 4 shows the positions of the genes along the first and second major axes produced by COA on codon counts. The closeness of any two genes on the plot reflects the similarities of their codon usages. As can be seen, the first axis individually could separate the genes into two clusters with little overlap. The following two facts indicate that the two groups correspond to genes on the leading and lagging strands of replication, respectively. (i) The first axis is found to strongly correlate with GC and AT skews. At the left end of the first axis, genes are characterized by richness in nucleotides G and T, whereas the case is opposite at the right end. (ii) The coordinates of individual genes along the first axis of COA are plotted against the chromosomal locations of the corresponding genes in Figure 5. Genes on the direct strand and those on the reverse complement strand are denoted by red and blue squares, respectively. It is found that genes on the left side of sequenced direct strand and genes on the right side of the reverse complement strand have lower coordinate values of the first axis, whereas, for the other genes, the opposite occurs. In fact, genes on the left side of direct strand and those on the right side of the reverse complement strand just correspond to genes on the leading strand, whereas the other ones correspond to the lagging strand. Therefore, it is reasonable to say that two clusters in Figure 4 correspond to genes on the leading strands and lagging strands, respectively. After marking genes located on the leading, lagging strands by different symbols in Figure 4, the speculation is confirmed.
A Chi-square test was then performed for comparing RSCU values of genes located on the two replicating strands and results are listed in Table 3. RSCU (Relative Synonymous Codon Usage) is defined in Equation 3. where xij is the occurence number of the
In the table 3, the symbl ++ indicates that the leading strand genes used the codon more frequently than the lagging strand genes, and the symbol -- indicates the lagging strand genes used the codon more frequently than the leading strand genes, whereas xx indicates that there is no significant difference in usage of the codon on either strand. In total, 49 among 59 codons are found to be significantly different between genes on the leading strand from those on the lagging strand. Among the 23 codons used more frequently in the leading strand, 19 are G-ending or T-ending and the exceptions are TTA, ACA, AGA and GCA. Among the 26 codons used more frequently in the lagging strand, 16 are C-ending, 8 are A-ending and codons CTT and ACT constitute the outliers. Results of the chi-square test confirm that there is a bias towards G, T in the leading strand, and towards C, A in the lagging strand. Therefore, it could be concluded that in
2.4. The underlying mechanism for the composition bias in bacterial genomes
As mentioned above, almost all the bacterial genomes have significant strand-specific composition biases. It is necessary and important to investigate the underlying mechanisms of such biases. Two published papers reviewed numerous explanations for the base composition bias in bacterial genomes (Frank and Lobry, 1999; Rocha, 2004). These hypotheses could be divided into two major categories (Necsulea and Lobry, 2007). The first hypothesis supposes that the replication mechanism is a direct cause of base composition asymmetry. The different mutation frequencies between the two replicating strands result in the nucleotide composition bias (Powdel et al., 2009). The second hypothesis states (Powdel et al., 2009) that the deviations from PR2 are associated with the strand asymmetry of the transcription mechanism, in combination with the gene distribution bias encountered in bacterial chromosomes (most protein-coding genes were located on the leading strands). This theory also falls back on mutation bias for detail interpretation. During transcription, template strand and non-template strand have different mutation probabilities and subsequent repair. As for the main cause of these asymmetries, numerous authors have provided many solid evidence in favour of the mutationist view by demonstrating that the base skews are mainly expressed at the third codon positions of genes as well as in non-coding regions where selective pressure is minimal (Lobry, 1996). For either mutationist views, cytosine deamination of single-stranded DNA performs a vital role in the generation of strand composition bias. The deaminiation of cytosine leads to the formation of uracil. Because of the Watson-Crick base paring, cytosine is effectively protected against deamination in normal circumstances in vivo. However, the rate of cytosine deamination increases 140 times in the single-stranded DNA (Beletskii and Bhagwat, 1996). If the resulting uracil is not replaced with cytosine, C -> T mutations occur. During the process of replication, the leading strand is exposed more time in the single-stranded state than the lagging strand. Therefore, C to T mutations occur more frequently in the leading strand than in the lagging strand and then the excesses of G(C) relative to C(G) and T(A) relative to A(T) are produced in the leading(lagging) strand. During transcription, the coding strand is more exposed in the single-stranded state. Therefore, it has more G over C.
Extensive evidence has been proposed to support the replication mechanism as a direct cause of base composition asymmetries (Necsulea and Lobry, 2007). As mentioned above, the analyses of the codon usage patterns, through correspondence analysis or other statistical methods, showed that in some bacterial species genes located on the replicating strands can be distinguished by their synonymous codon choice (McInerney, 1998; Wei and Guo, 2010). Using the ANOVA method on GC and AT skews, with gene direction and replication orientation as the explanatory variables, Tillier and Collins showed that the nucleotide composition of a bacterial gene is significantly influenced by its position on the leading or the lagging strand for replication (Tillier and Collins, 2000). Lobry and Sueoka (2002) performed one thorough analysis on 43 prokaryotic chromosomes and confirmed that deviations from parity rule 2 differ significantly between leading and lagging strands. This is one of the convincing evidences. Worning et al. (2006) suggested that the sign of AT-skew is determined by the polymerase alpha subunit that replicates the leading strand. In bacteria such as
The second hypothesis also has its supporting evidence. Francino et al. (1996) concluded that the substitution patterns were similar on the leading and lagging strands, but significantly different between the coding and non-coding strands, based on the observation of several genes in E. coli K12. Therefore, they suggested that a process linked to transcription rather than the mode of replication caused the nucleotide asymmetry. Note that a partly contradictory result was obtained by Rocha et al. (2006), at the whole genomic scale in the same species. According to them, the C to T substitution is much higher in leading strands than in lagging strands in
Based on the artificial genome rearrangement proposed by Nikolaou and Almirantis, Necsulea and Lobry (2007) developed one novel method to distinguish the replication and transcription effects on base composition asymmetry. Their results suggested that the effect of replication on the GC-skew is generally very strong. For numerous species, the AT-skew is caused by coding sequence-related mechanisms. Therefore, the cause of base composition bias in bacterial genomes would be the superposed effect of replication and transcription. The superposed effect of the two processes may be the sum or the difference. In other words, transcription-associated asymmetries can either increase or decrease replication-associated strand asymmetries, depending on the transcription direction and the position of the gene relative to the origin of replication (Necsulea and Lobry, 2007; Mugal et al., 2009). See also the chapter by Seligmann in this book.
2.5. Why there exists extremely strong strand composition bias in obligate intracellular parasites?
As mentioned above, 11 bacteria have been found to have extremely strong strand composition bias (Wei and Guo, 2010). The bias is strong enough to divide base and codon usages according to whether genes are located on the leading or lagging strands. Their names are
As reported in many cases, the living environment and living styles may exert influence on the genomic G+C content and on codon usgaes of genes. Based on this consideration, we compare the living habitation of the 11 bacteria. Among them, 9 belong to oligate intracellular parasites and this means they live permanently in the cell of their host. Due to this safe habitation in the living cell, they would suffer less damage on DNA from ultraviolet radiation or other physical, chemical factors than freely living bacteria. After long-term evolution, some or most genes coding for DNA repair enzymes may be lost from these species. Due to the loss of such genes or enzymes, mutations generated during the replication process are not effectively corrected. The replicaiton associated mutation in obligate intracellular parasites would accumulate much more than in freely living bacteria. Such mutations might be a major cause for the strand composition bias in bacterial genomes. So, more mutations, more bias. The above deduction is our speculation. Its correctness should be validated by a large scale test in the future.
Secondly, chromosomes of the 11 bacteria are all shorter than 2000 kb. According to statistics on fully sequenced genomes, bacterial chromosomes vary from 160 kb to more than 10000 kb. However, all 11 species have small genome sizes although some of these bacteria are not endosymbionts. Hence, we supposed that small genome size is a necessary condition to generate strong enough strand-specific mutational bias. Perhaps in small bacterial genomes that have suffered reductive evolution, the repair mechanism of replication may be inefficient. Alternatively, in bacteria with larger chromosome, the mutation pressure is hard to prevail over translational selection.
Thirdly, all of the 11 bacteria have medium or low genomic G+C content. Among them,
Fourthly, the strong mutation bias may be associated with the absence of certain genes involved in chromosome replication. As suggested by Klasson and Andersson (2006), the strong strand-specific mutational bias in endosymbiont genomes coincides with the absence of genes associated with replication restart. After a comparative analysis on 20 gamma-proteobacterial genomes, it was found that endosymbiont bacteria lacking
Finally, Figure 6 shows the y component curves of the Z curve defined in equation (2) for five representatives of the 11 bacteria with extremely strong strand composition bias. For comparison, the y component curve of the
translocations to another half of a chromosome, or integration of foreign DNA into the chromosome. Note that protection against mutations by secondary strcuture formation also explains such strand asymmetries (Krishnan et al., 2004). In other words, chromosome rearrangements often are exhibited as little prickles in the y component curves. Therefore, we could make the conclusion that the 11 bacterial chromosomes are highly stable and have very few rearrangements. According to Rocha (2004), lower rearrangement frequency are just the most likely reasons for the appearance of seperate codon usages in some obligate intracellular parasites. Our results confirmed this speculation.
3. Strand composition bias in eukaryotes, organelles, archaea and plasmids
Compared with bacterial genomes, studies on strand composition bias in eukaryotic genomes are limited. Most analyses of eukaryotic genomes did not show strand compositional asymmetry at chromosome scale (Grigoriev, 1998; Gierlik et al., 2000). It is probably a result of a relative excess of autonomously replicating sequences (ARS) and of random choice of these sequences in each replication cycle (Gierlik et al., 2000). However, the examination of three contigs from human genomes gave some evidence of strand compositional asymmetries. In addition, local asymmetries have been found in the last ARS from both ends of chromosomes of
As for human genomes, Francino and Ochman (2000) failed to detect the asymmetry of some replicons by the phylogenetic comparisons. Analysis of the whole set of human genes revealed that most of them presented TA and GC skews (Touchon et al., 2003). The two kinds of biases are correlated to each other and they are specific to gene sequences, exhibiting sharp transitions between transcribed and non-transcribed regions. At the same time, Green et al. (2003) also described a qualitatively different transcription-associated strand asymmetry in humans. In their study, human orthologous sequences were generated by aligning with eight other mammals. The authors saw pronounced asymmetric transition substitutions in the transcribed regions of human chromosome 7. The transitions of Α to G were 58% more frequent than Τ to C and G to A transitions were 18% more frequent than C to T. With ‘maximal segment’ analysis, they showed that the strand asymmetry was associated specifically with transcribed regions. Two years later, Touchon et al. (2005) analyzed intergenic and transcribed regions flanking experimentally identified human replication origins and the corresponding mouse and dog homologous regions. They demonstrated that there existed compositional strand asymmetries associated with replication. By using wavelet transformations of skew profiles, the authors revealed the existence of 1000 putative replication origins associated with randomly distributed termination sites in human genome (Touchon et al., 2005). Around these putative origins, the skew profile displayed a characteristic jagged pattern which was also observed in mouse and dog genomes. By analyzing the nucleotide composition of intergenic sequences larger than 50 kb by cumulative skew diagrams, Hou et al. (2006) found replication-associated strand asymmetry in vertebrates including humans. Therefore, they proposed that transcription-associated strand asymmetries masked the replication-associated ones in the human genome. Huvet et al. (2007) found with multi-scale analysis that the base skew profile presented characteristic patterns consisting of successions of N-shaped structures in more than one-quarter of the human genome. These N domains are bordered by putative replication origins. Wang et al. (2008) illustrated that transcription-associated strand compositional asymmetries and replication-associated ones coexist in most vertebrate (including human) large genes although in most cases the former conceals the latter. The three most frequent types of asymmetric substitution, C to T, A to G, and G to T, were examined in the human genome (Mugal et al., 2009). All three rates were found to be on average higher on the coding strands than on the transcribed. Such finding points to the simultaneous action of rate increasing effects on the coding strands, such as increased adenine and cytosine deamination, and transcription-coupled repair as a rate-reducing effect on the transcribed strands. Furthermore, the author showed that the rate asymmetries of genes are to some extent also produced by the process of replication, depending on the distance to the next ORI and the relative direction of transcription and replication (Mugal et al., 2009). With the help of the very recently published work by Chen et al. (2011), we conclude that strand composition asymmetry (bias) is the superposed effect of replication and transcription asymmetries in the human genome. Among them, transcription associated mutation and or repair bias exert effects on transcribed regions. However, replication induced mutation and repair biases act on the whole chromosome. This is quite similar to bacterial genomes.
As for eukaryotic organelles, there are quite a few reports of strand bias. For example, Seligmann and colleagues observed strand asymmetric gradients in various mitochondria and investigated in the past five years how properties of replication origins affect the gradients (Seligmann, 2010; Seligmann and Krishnan, 2006; Seligmann et al., 2006a, 2006b).
Regarding archaea, a few have shown significant strand composition skews, which are associated with replication. Among them, some are determined or predicted to contain a single replication origin, while others have multiple origins of replication, similar to eukaryotes. According to Necsulea and Lobry (2007), 18 out of 29 archaeal chromosomes showed significant effects of replication on nucleotide skews
Usually, it is believed that bacterial plasmids replicate using a different mechanism than that of the chromosome of their host cell. In 2000, cumulative skew diagrams showed that plasmid and chromosome of
4. Conclusion and future research
Strand composition bias has been found in various genomes for 20 years. The cause of base composition bias in bacterial genomes is supposed to be the superposed effect of replication and transcription asymmetries in mutation biases. In some species, the former mechanism is mainly responsible for the bias, while in some others the latter constitutes the major force driving the bias. In others, the two mechanisms have equally important effects. Transcription-associated asymmetries can either increase or decrease replication-associated strand asymmetries, depending on the transcription direction and the position of the gene relative to the origin of replication. Theoretically predicting replication origins is one of the practical applications of the universal phenomenon of strand composition bias in bacterial genomes. Future work should focus on the following aspects: (1) Investigation of the common characters and mechanisms of the biases between prokaryotic and eukaryotic genomes; (2) The cause for the varying strength of composition bias in different bacterial genomes; (3) More works should be performed on strand composition bias in eukaryotes other than Homo sapiens and in archaeal genomes.
The present study was supported by the National Natural Science Foundation of China (grant 60801058 and 31071109), and the Fundamental Research Fund for the Central Universities of China (grant ZYGX2009J082).