Number distribution of transmembrane helices represents genetic feature of survival strategy, because the number of transmembrane helices is closely related to the functional group of membrane proteins: for example, most of membrane proteins that have six transmembrane helices belong to transporter functional group. Survival strategies were obtained by evolutionary mechanism that changes the genome sequences. Comparisons of number distributions of transmembrane helices among species that have different survival strategies help us to understand the evolutionary mechanism that has increased the categories of membrane proteins.
Some studies about how the categories of protein functions have been increased during evolution were performed using protein database (Chothia et al., 2003; Huynen & van Nimwegen, 1998; Koonin et al., 2002; Qian et al., 2001; Vogel et al., 2005). However, these studies were carried out by the analysis almost for soluble proteins. Classification of protein function groups are often carried out by the empirical methods such as sequence homology that use sequence information of three-dimensional structure resolved proteins as template sequences for each functional group. However three-dimensional structure resolved membrane proteins were much less than that for the soluble proteins because of experimental difficulty of membrane proteins.
In the previous study, we developed membrane protein prediction system SOSUI and signal peptide prediction system SOSUIsignal (Gomi et al., 2004; Hirokawa et al., 1998). By combination of those systems, number of transmembrane helices can be predicted based not on empirical but on physicochemical parameters. Therefore, it is possible to investigate the number distribution of transmembrane regions in membrane proteins comprehensively among various genomes by using SOSUI and SOSUIsignal.
2. Membrane protein prediction systems
SOSUI prediction software (Hirokawa et al., 1998; Mitaku et al., 2002) for transmembrane helix regions uses physicochemical features of transmembrane helix segments. Transmembrane helix regions have three common features: (1) a hydrophobic core at the center of the helix segment; (2) amphiphilicity at both termini of each helix region; and (3) length of transmembrane helix regions. These features are essential factors for the transmembrane segment to stably present at the cell membrane. The SOSUI system first enumerates candidates of transmembrane regions by the average hydrophobicity of segments which are then discriminated by the distributions of the hydrophobicity and the amphiphilicity around the candidate segments.
SOSUIsignal (Gomi et al., 2004) predicts signal peptides that are removed from proteins that are secreted to the extracellular space via the secretary process. Signal peptides are present at the amino terminal segment of their respective proteins; the physicochemical features N-terminal structure is recognized by molecular modules during the cleavage process. The SOSUIsignal system is similar to the SOSUI system in that candidates are first enumerated by the average hydrophobicity at the amino terminal region and then real signal peptides are discriminated by several parameters.
By focusing on these physicochemical features, accuracy of the prediction systems is very good: approximately 95%for SOSUI and 90% for SOSUIsignal. By using these softwares, we can estimate not only function unknown protein sequence but also simulated ones.
3. Typical number distribution of transmembrane regions in membrane proteins
We investigated the population for number groups of transmembrane helices for 557 prokaryote genomes using SOSUI and SOSUIsignal. Figure 1 shows the results of the analysis of the membrane protein encoded in the E. coli genome as a typical example; the
largest category of membrane proteins comprised proteins with only one transmembrane-spanning helix, and the second largest category comprised proteins with two transmembrane-spanning helices. The populations within each category decreased gradually up to 4 transmembrane helices and then there is a plateau from 4-13 helices. The population within each category decreased rapidly for categories comprising proteins with more than 13 transmembrane helices and there were apparently no proteins with more than 16 transmembrane helices. These results indicated that membrane proteins that have a particular number of transmembrane helices, such as 12, are important for E. coli.
4. Variety of number distribution of transmembrane regions among organisms
The general trend of the number distribution of transmembrane helices were very similar among 557 prokaryotic genomes, but the fine structures of the number distribution of transmembrane helices can change during the evolution. Four graphs in Fig.2 show the results for the analysis for four prokaryotic genomes: A, Pyrobaculum calidifontis; B, Thermoplasma volcanium; C, Pseudomonas putida and D, Thermotoga petrophila. We selected these four kinds of organisms for showing how the number distributions of transmembrane helices are different among organisms. The number distribution of transmembrane helices for P. calidifontis did not show significant shoulder at 13 helices as E. Coli. The shape of the distribution for T. volcanium was very similar to that for E. Coli, although the population was much smaller. A significant peak was observed at 12 helices for P. putida. A peak at 6 helices was observable for T. petrophila. Despite of the difference in the number distribution of transmembrane helices among organisms, the general trend of the distribution suggests the existence of a target distribution.
5. Number distribution of transmembrane helices in proteins in organisms grouped by GC contents
If the difference in the number distribution is due to the fluctuation around a target distribution, the difference would decrease by averaging of the distribution of many organisms. In contrast if the difference is due to some systematic change among organisms, the difference would not disappear by a simple procedure of the averaging. The GC content of genomes differs widely among species, from 0.3 to 0.7, and it is well known that various characteristics of prokaryotic cells systematically change according to GC content. Therefore, we investigated whether the distribution in the number of transmembrane helices per protein changed according to the GC content. Genomes for 557 prokaryotes were classified into nine groups with different GC content. In Fig. 3, the average number distributions of transmembrane helices in the nine groups indicated that the distributions were unchanged despite differences in GC content. The membrane-protein profile of the nine groups shared a common feature in that the general shapes of the curves were the same; the curves gradually decreased in the population of each category of membrane protein and there was a shoulder at the categories with 12 transmembrane helices. This result strongly suggests that the difference in the fine structures of the number distribution is due to the fluctuation around a general curve of Fig. 3. Then, aquestion arises about the natural selections: Is the general curve formed by the pressure of natural selection?
6. Random sequence simulation
Presumably, functionally important proteins are maintained in biological genomes by natural selection. Therefore, the general curve of the number distribution of transmembrane helices must reflect natural selection that occurs during biological evolution. The prediction systems, SOSUI and SOSUIsignal, have the great advantage that they are applicable to any amino acid sequences independent of empirical information because they are based mainly on the physicochemical parameters of amino acids. So, we planned to use the prediction system for comparing the number distributions of transmembrane helices between the real genomes and the simulated genomes in which comprehensive mutations are introduced with any pressure of the natural selection. Therefore, we investigated the effect of random mutation uncoupled from natural selection on the number distribution of transmembrane helices using random sequence simulations. The E. coli genome was used for the random sequence simulation. At each simulation step, one in every 100 amino acids in all protein sequences was mutated randomly. When the amino acids were mutated, the new amino acids were determined according to the genomic amino acid composition of the E. coli genome. Distributions of number of transmembrane helices were estimated by using membrane protein prediction systems SOSUI and SOSUIsignal after each simulation step. Simulations were reiterated until 500 simulation steps.
Distributions of transmembrane helices in membrane proteins for simulated genomes are shown in Fig. 4. As the simulation steps proceed, the number of membrane proteins with more than six transmembrane helices decreased monotonously and the shoulder in the distribution around 12 transmembrane helices disappeared. Beyond 300 simulation steps at which the sequences were completely randomized, the distribution became very similar to a single exponential decay. A broken line in Fig.4 represents the single exponential decay curve, y = 2090e-0.87x, which was obtained least square deviation analysis for the averaged distribution between 300 and 500 steps.
After 300 simulation steps, shapes of the distributions of number of transmembrane helices for simulated genomes were almost unchanged in spite of additional mutations. A single exponential distribution for simulated genome can be explained by a kind of reaction in the evolutionary time scale changing the number of membrane proteins due to extensive mutations.
in which TM i represents a membrane protein with i transmembrane helices. If the equilibrium constant is the same among the distinct equilibrium state, the shape would become the exponential, as follow:
where <TM n >, <TM n-1> and <TM 0> represent the population of the membrane protein, with n, n-1 and 0 helices, respectively, and k + /k - means the equilibrium constant. In the simulation, the equilibrium constants for each transmembrane helices number group are the same from the algorithm of the prediction systems, and a single exponential decay in the computer experiment is well interpreted by this model. However, in the real genome, the shape of distribution is not exponential, showing a significant plateau and shoulder. This indicates that there equilibrium constants for each transmembrane helices number groups are not same. This may be due to the difference of the functional importance among membrane protein groups.