Estimating Human Point Mutation Rates from Codon Substitution Rates

Estimation of point mutation rates is essential for studying molecular evolution and genetics. Point mutation rates are also important for developing tools for genome analyses, such as those used for homology searches (Altschul et al., 1990), sequence alignments (Katoh et al., 2002; Larkin et al., 2007), gene finding (Misawa and Kikuno, 2010), or detecting natural selection (Nei and Gojobori, 1986; Hughes and Nei, 1988; Yang, 2007; Yang and Nielsen, 2008), and for reconstructing phylogenetic trees (Felsenstein, 2004; Sullivan and Joyce, 2005). Patterns of mutations also affects the neutrality test for population genetics (Misawa and Tajima, 1997). According to the neutral theory (Kimura, 1968), new alleles may be produced at the same rate per individual as they are substituted in a population. On the basis of this theory, mutation rates were estimated from neutral substitution rates.


Introduction
Estimation of point mutation rates is essential for studying molecular evolution and genetics. Point mutation rates are also important for developing tools for genome analyses, such as those used for homology searches (Altschul et al., 1990), sequence alignments (Katoh et al., 2002;Larkin et al., 2007), gene finding (Misawa and Kikuno, 2010), or detecting natural selection (Nei and Gojobori, 1986;Hughes and Nei, 1988;Yang, 2007;Yang and Nielsen, 2008), and for reconstructing phylogenetic trees (Felsenstein, 2004;Sullivan and Joyce, 2005). Patterns of mutations also affects the neutrality test for population genetics (Misawa and Tajima, 1997). According to the neutral theory (Kimura, 1968), new alleles may be produced at the same rate per individual as they are substituted in a population. On the basis of this theory, mutation rates were estimated from neutral substitution rates.
One of the causes of mutations is the error during DNA replication (Pray, 2008). Since cell division is tightly linked to DNA replication, mutation rates are expected to correlate the number of cell divisions so that they are higher in sperms than in egg (Haldane, 1956;Miyata et al., 1987). The phenomenon in which mutation rate are higher in males than in females is called 'male-driven evolution' (Miyata et al., 1987). The previous studies show large discrepancies with regard to the effect of male-driven evolution on mutation rates (Li et al., 2002). To investigate the effect of DNA replication on mutation rates, the effect of male male-driven evolution was also investigated in this study.
Recent studies on mutation rates in human non-coding regions have shown that mutation rates in the human genome are negatively correlated to local GC content (Fryxell and Moon, 2005;Taylor et al., 2006;Tyekucheva et al., 2008;Walser et al., 2008) and to the densities of functional elements (Hardison et al., 2003;Hellmann et al., 2005;Tyekucheva et al., 2008). Mutation rates have also been shown to correlate with the distance of a gene from the telomere (Hellmann et al., 2005;Tyekucheva et al., 2008). A study on mutation rates in human coding regions showed that mutation rates also depend on the chromosome sizes, and this is probably attributed to the distance between the genes and telomeres (Misawa, 2011).
Mutation rates are sometimes dependent of the local context, especially the adjacent nucleotides (Cooper and Youssoufian, 1988;Cooper and Krawczak, 1989;Hobolth et al., 2006;Misawa et al., 2008;Misawa and Kikuno, 2009;Misawa, 2011). CpG hypermutability is a major cause of codon substitution in mammalian genes (Huttley, 2004;Lunter and Hein, 2004;Misawa et al., 2008;Misawa and Kikuno, 2009). CpG hypermutation occurs approximately 10 times or more rapidly than other types of point mutations do (Scarano et al., 1967;Bird, 1980;Lunter and Hein, 2004). Figure 1 shows cytosine (C), methylcytosine (methyl-C), and thymine (T). The notation CpG is used to distinguish a C followed by a guanine (G) from a Watson-Crick pair of C and G. CpG dinucleotides are often methylated at C by DNA methyltransferase (DNMT) (Wu and Zhang, 2011), and methyl-C spontaneously undergoes deamination to generate T. The mutation pressure of CpG hypermutability is so high in organisms with DNMT, that they share a similar pattern of amino acid substitutions (Misawa et al., 2008). This pattern of amino acid substitutions used be called the 'universal' trend (Jordan et al., 2005). Misawa et al. (2008) also showed that organisms who lost DNMT, such as Buchnera and Saccharomyces do not share the 'universal' trend. In the mouse, the effect of CpG hypermutability on codon preference is stronger than that of tRNA abundance (Misawa and Kikuno, 2011). When mutation rates are estimated, the effect of adjacent nucleotides on mutation rates and the direction of mutation rates should be considered. Some of previous studies used mutation models that do not consider the effect of adjacent nucleotides; these mutation models were the REV model (Tavare, 1986;Yang, 1994;Whelan et al., 2001) used by Hardison (Hardison et al., 2003), and the JC (Jukes and Cantor, 1969) and HKY models (Hasegawa et al., 1985) used by Tyekucheva (Tyekucheva et al., 2008). Tayler's (Taylor et al., 2006) study was based on two-species comparison, and therefore, the direction of mutation was unclear. Recently, I and my colleagues (Misawa and Kikuno, 2009;Misawa, 2011) estimated the mutation rates in humans by considering the effect of adjacent nucleotides on mutation rates and the direction of mutations. These studies, however, did not consider the effect of the distance of a gene from the telomere on mutation rates.
To understand the effect of DNA replication and the distance of a nucleotide site from the telomere, mutation rates were estimated using the codon substitution rates in the coding regions of thousands of human and chimpanzee genes by using autosomes and X chromosome; further, the ancestral gene sequences were inferred by assuming macaque www.intechopen.com genes as the outgroup in this study. Regression analyses were conducted to evaluate the effect of GC content, gene density, and CpG island density on the rates of CpG-to-TpG mutations, TpG-to-CpG mutations, and non-CpG transitions and transversions.

Data set
To estimate the rates of mutation rates, we used 10,372 orthologous gene trios obtained from human, chimpanzee, and macaque genomes (Gibbs et al., 2007). Table 1 shows the sizes of human chromosomes of the NCBI human genome (Build 36) taken from the UCSC genome browser (http://genome.ucsc.edu). Table 1 also shows number of genes on each chromosome used in this study. These genes were binned into a series of 10-Mb windows of human DNA depending on the positions of the midpoint of genes.

Estimation of mutation rates from substitution rate on fourfold sites
To estimate mutation rates from nucleotide substitution rates, it is important to separate substitutions representing neutral evolutionary drift from those influenced by selection. Following Hardison et al.'s study (Hardison et al., 2003), fourfold sites were analyzed in this www.intechopen.com study. Fourfold sites are sites marked "N" in the codons GCN (Ala), CCN (Pro), TCN (Ser), ACN (Thr), CGN (Arg), GGN (Gly), CTN (Leu), and GTN (Val). is worth noting that nucleotide substitutions at fourfold sites do not change amino acids so that they can be considered as neutral. In the case of mouse, the effect of codon preference on nucleotide substitutions would be smaller than that of mutations (Misawa and Kikuno, 2011). In this study, the mutation rates were estimated from the substitution rates at fourfold sites by assuming that the substitutions at fourfold sites are neutral.

Classification of sites and mutations
To evaluate the effect of CpG hypermutability on mutation rates, fourfold sites were classified into three categories depending on the adjacent nucleotides, namely, CpG sites, TpG sites, and usual sites. If the nucleotide at fourfold site is C and the first nucleotide of 3'adjacent codon is G, the site is classified into CpG site. If the nucleotide at fourfold site is T and the first nucleotide of 3'-adjacent codon is G, the site is classified into TpG site. If the nucleotide at fourfold site is C and the first nucleotide of 3'-adjacent codon is A, the site is classified into TpG site, because on the complementary strand T is next to G. The sites that are neither CpG site nor TpG site are classified as usual site.
Mutations were classified into 4 categories: CpG to TpG mutations, TpG to CpG mutations, and non-CpG transitions and transversions. To distinguish CpG to TpG mutations and TpG to CpG mutations from non-CpG transitions, the adjacent nucleotides were also considered.
If the observed nucleotide was T, its ancestral nucleotide was C, and the downstream nucleotide was G, the mutation was classified as a CpG to TpG substitution. For example, a mutation from CCC (Pro) to CCT (Pro) is classified into a CpG to TpG mutation only when 3' adjacent nucleotide of the codon is G. If the observed nucleotide was A, its ancestral nucleotide was G, and the upstream nucleotide was C, the mutation was again classified as a CpG to TpG mutation because a CpG to TpG mutation occurs on the complementary strand of DNA. For example, a mutation from CCG (Pro) to CCA (Pro) corresponds to a CpG to TpG mutation. If the observed nucleotide was C, its ancestral nucleotide was T, and the downstream nucleotide was G, the mutation was classified as a CpG to TpG mutation. If the observed nucleotide was G, its ancestral nucleotide was A, and the upstream nucleotide was C, the mutation was again classified as a CpG to TpG mutation. Other types of mutation are classified into two types: transitions and transversions (Misawa et al., 2008;Misawa and Kikuno, 2009;Misawa, 2011;Misawa and Kikuno, 2011).

Estimation of mutation rates by using the Maximum Parsimony (MP) method
I determined the codon sequences of the common ancestors of humans and chimpanzees by using the maximum parsimony (MP) method. Next, I counted the number of codon substitutions that had occurred along the human lineage. For some codon trios, the ancestral state between the human and chimpanzee codons appeared ambiguous when estimated by the MP method. In such cases, all possible ancestral states were treated equally. I also calculated the mutation rates by dividing the number of codon substitutions occurring annually by the number of ancestral codons. It was assumed that the human-chimpanzee divergence occurred 5 million years (MY) ago (Horai et al., 1995). Macaque genes were used as the outgroup. To calculate confidence intervals of the estimates, the binomial distribution was assumed. The awk program used in the analysis is available from the author upon request.

Comparison between mutation rate and the distance from the telomere by using regression analysis
I compared the mutation rates with the distance of a central position of 10-MB windows from the telomere. These values were used in regression analyses, which were performed using the statistical software R (R Development Core Team, 2008). Table 2 shows the mean value of the mutation rate estimates per BY per site and their 99% confidence intervals. The mutation rate of CpG to TpG is similar to that obtained by the previous study, but the rate of other types of mutation is lower than those obtained by the previous study (Misawa, 2011). The mutation rate of CpG to TpG is about 10 times higher than that of transitions and transversions on usual sites. This ratio is similar to previous studies (Scarano et al., 1967;Bird, 1980;Lunter and Hein, 2004). The mutation rates on autosomes were similar to those on X chromosome, except the rates of transversion on usual sites. The rate of transversion on usual sites on autosomes is significantly higher than that of X chromosomes after the Bonferroni correction (P<0.05).   figure 3 is for autosomes and the lower panel is for X chromosome. Figure 3 shows a scatter plot of the mutation rates on TpG sites. Figure 4 shows a scatter plot of the mutation rates on usual sites. These figures show that mutation rates are negatively correlated to the distances from telomeres. These figures also show that a large variation in the rates of mutations exist among genomic regions. Table 3 shows Pearson's correlation coefficient between mutation rates and the distances from telomeres. Only the mutation rates of TpG to CpG and transversion on TpG sites were significantly correlated to the distance from telomeres (P < 0.001 and P < 0.01, respectively) after the Bonferroni correction. There were no significant differences of the correlation coefficients between autosomes and X chromosomes.  Table 3. Correlation coefficient between mutation rates and distances to telomeres by regression analysis

Discussion
In this study, the rates of CpG to TpG mutations, TpG to CpG mutations, and non-CpG transitions and transversions were estimated by comparing the coding regions of thousands of human and chimpanzee genes from entire genome and inferring their ancestral sequences by assuming macaque genes as the outgroup. The mutation rate of CpG to TpG is about 10 times higher than that of transitions and transversions on usual sites. This ratio is similar to previous studies (Scarano et al., 1967;Bird, 1980;Lunter and Hein, 2004). The mutation rate of CpG to TpG is similar to that obtained by the previous study, but the rate of other types of mutation is lower than those obtained by the previous study (Misawa, 2011), probably because previous studies included nonsynonymous substitutions (Misawa, 2011) while only fourfold sites were analyzed in this study. As seen in table 2, a significant difference was not observed between the mutation rates of autosomes and X chromosome, except the rates of transversion on usual sites This result indicates that the effect of "male-driven evolution" (Miyata et al., 1987) is not strong. Figures  2, 3, and 4 show that a large variation in the rates of mutations among genomic regions. These results might be caused by the fact that mutation rates are affected by various factors, such as gene density, GC contents and the density of CpG islands (Misawa, 2011).
Vogel and Motulsky (Vogel and Motulsky, 1997) pointed out that since the deamination of methyl-C occurs spontaneously and is independent of DNA replication, the rate of CpG mutations should be scaled with time and not with the number of cell divisions. Recently, Taylor et al. (Taylor et al., 2006) investigated male mutation bias separately at non-CpG and CpG sites by using human-chimpanzee whole-genome alignments. They concluded that CpG hypermutation is weakly affected by the number of cell divisions. As pointed out by www.intechopen.com Misawa (2011), the effect of male-driven evolution on CpG hypermutation is weaker than that of other chromosomal properties. Further study must be necessary. Figures 2, 3, and 4 indicate that the CpG to TpG substitution rates were negatively correlated to the distances from telomeres. This is consistent with previous studies (Hellmann et al., 2005;Tyekucheva et al., 2008), although their methods for estimating mutation rates were different from this study. However, Table 3 shows that only the mutation rates of TpG to CpG and transversion on TpG sites were significantly correlated to the distance from telomeres after the Bonferroni correction. Tyekucheva et al. (Tyekucheva et al., 2008) suggested the existence of additional mutagenic mechanisms that increase neutral substitution rates in subtelomeric regions. Increased divergence near telomeres has been linked to direct and indirect effects of large-scale chromosomal structure. If the correlation coefficients between mutation rates and the distances from telomeres on X chromosome are different from that on autosomes, cell division and DNA replication might be a part of such mutation mechanisms. Unfortunately, no significant differences of the correlation coefficients between autosomes and X chromosomes were observed in this study.
The numbers data points are not very large; thus, dividing too many bins by the distance from the telomeres may yield weak results. As more data become available, incorporating these additional predictors in the regression analyses may be beneficial.