Different DNA bases and protein amino acids between FAM75 (unnamed protein product) and FAM205A in the NCBI records.
Little is known about protein sequences unique in humans. Here, we performed alignment-free sequence comparisons based on the availability (frequency bias) of short constituent amino acid (aa) sequences (SCSs) in proteins to search for human-specific proteins. Focusing on 5-aa SCSs (pentats), exhaustive comparisons of availability scores among the human proteome and other nine mammalian proteomes in the nonredundant (nr) database identified a candidate protein containing WRWSH, here called FAM75, as human-specific. Examination of various human genome sequences revealed that FAM75 had genomic DNA sequences for either WRWSH or WRWSR due to a single nucleotide polymorphism (SNP). FAM75 and its related protein FAM205A were found to be produced through alternative splicing. The FAM75 transcript was found only in humans, but the FAM205A transcript was also present in other mammals. In humans, both FAM75 and FAM205A were expressed specifically in testis at the mRNA level, and they were immunohistochemically located in cells in seminiferous ducts and in acrosomes in spermatids at the protein level, suggesting their possible function in sperm development and fertilization. This study highlights a practical application of SCS-based methods for protein searches and suggests possible contributions of SNP variants and alternative splicing of FAM75 to human evolution.
- availability score
- short constituent sequence (SCS)
- alternative splicing
- single nucleotide polymorphism (SNP)
- human genome
- human proteome
The human species has unique traits among animals. It is well known that morphological and physiological traits such as erect bipedalism, speech and language, and long reproductive period are very different from those of other primate species. Only humans have high intelligence that fosters sophisticated communications and complex societies. This intelligence is related to continuous brain development after birth in humans, which is not observed in other great apes, including chimpanzees . The evolutionary emergence of these unique traits in humans likely contributes to human speciation. The simplest hypothesis to explain human uniqueness is that it originates from the uniqueness of constituent molecules (i.e., genes and proteins) themselves. In this “constituent hypothesis,” humans have unique genes and proteins that do not exist in chimpanzees. A contrasting hypothesis is that constituent molecules are similar between humans and chimpanzees, but they are regulated differently in these species. That is, in this “regulatory hypothesis,” a similar set of proteins may be produced but at different times (heterochrony), in different locations (heterotopy), in different amounts (heterometry), and in different usage (heterotypy) . These regulatory changes in gene expression seem to be evolutionarily parsimonious and, indeed, are supported by comparative observations at phenotypic levels .
One line of support for the regulatory hypothesis comes from genomics and developmental expression studies. Following the announcement of a human genome release , the genomes of other great apes were sequenced [5, 6, 7]. Comparisons of DNA sequences between humans and chimpanzees have revealed that nucleotide differences are only 1.23% in aligned sequences, and most of these differences are thought to be functionally insignificant . Further rigorous comparisons throughout these genomes have revealed that nucleotide differences are 4% and that they are mostly located in noncoding regions . The expression patterns of some genes are different between humans and chimpanzees during development [9, 10, 11, 12]. Differences in transcriptomes have revealed that species differences in expression patterns are tissue-dependent and that testes have the greatest difference [13, 14]. It has been speculated that the accumulation of small expression or regulatory differences leads to large phenotypic differences between humans and chimpanzees . On the other hand, while these findings support the regulatory hypothesis, they do not necessarily reject the constituent hypothesis [15, 16]. RNA-mediated mechanisms for novel genes have been proposed together with the “out of the testis” hypothesis, in which testis is considered a tissue for experimenting with new genes . Comparisons among transcriptomes in primates have revealed that many genes for spermatogenesis in testes, which likely inhibit apoptosis when mutated, are positively selected [17, 18].
Although these genome comparison studies advance this field, there are a few inherent problems. First, their results are heavily dependent on database quality because of their methodological nature. Most genome sequences were draft sequences at the time of public release, likely containing numerous sequencing and assembling mistakes. For example, the previous chimpanzee genome was assembled in reference to the human genome, which means that genomic regions in chimpanzees that are different from those in the human genome may have been assembled to create false sequences, although continuous revisions have been made . Even in the human genome, many previous gene records generated by automated assemblers have been removed after revisions. Moreover, population sampling bias from the sequenced genome cannot be avoided when samples from a small number of individuals are sequenced. The case of a transcription factor, FOXP2 (forkhead box P2), is an object lesson: FOXP2 has been proposed to have played a key role in human-specific evolution by assisting speech and language , but that evidence is likely to be weak and probably incorrect because of sampling bias .
Second, such genome comparisons are largely based on sequence alignments [22, 23]. Although sequence alignment methods are powerful and probably the most important in comparison studies, sequences that do not contain relatively long regions of similarity cannot be compared well. In other words, short sequences that do not extend to longer similarities are discarded as noise . Although this strategy is highly successful, it assumes that nonaligned short sequences are not important, which may not always be true. There may still be important differences undiscovered where alignments are not possible.
An approach to the second issue above is to develop alignment-free methods. The advantage of the alignment-free approach is that any collections of proteins can be compared quantitatively. Although various types of alignment-free approaches have been developed [24, 25], including our previous attempts to use membrane topology  and a self-organizing map , the alignment-free approach in the present study is based on the “availability” (frequency bias) of short constituent sequences (SCSs) of amino acids (aa) in proteins [28, 29, 30, 31, 32, 33]. The length of SCSs can be 2 aa (doublet), 3 aa (triplet), 4 aa (quartet), 5 aa (pentat), and more in a given protein. This SCS-based analysis is basically similar to other related analyses for amino acid sequence patterns that were called under different terms with slightly different mathematical operations: oligopeptide patterns [34, 35, 36, 37, 38, 39], amino acid sequence repertoire , peptide vocabulary ,
In our approach, protein sequences are considered to be composed of many SCSs. Importantly, the number of possible SCSs is limited because a protein is composed of just 20 kinds of amino acids; there are 400 (=202) permutations of 2-aa SCSs (doublets), 8000 (=203) permutations of 3-aa SCSs (triplets), 160,000 (=204) permutations of 4-aa SCSs (quartets), and 3,200,000 (=205) permutations of 5-aa SCSs (pentats). Frequencies of individual SCSs in a given protein database can be inferred theoretically based on frequencies of component amino acids, which is called the expected frequency (
Using this simple concept of availability score, secondary structure characterization has been performed; SCS frequencies (and thus availability scores) are different among different secondary structures . Availability scores are also different between parallel and antiparallel β-strands . This approach is also relevant to identifying sequence motifs in some, although not all, proteins . It has been shown that triplet compositions in proteomes may reflect phylogenetic relationships [32, 37, 53]. We believe that this approach is applicable to understanding species specificity.
We have implemented several applications as the SCS Package that informatically examine protein sequences . Among them, we have built an application for identifying species-specific SCSs. In the present study, we compared human and other 9 mammalian proteomes based on availability analysis of 5-aa SCSs (pentats) to identify human-specific pentats. We hypothesized that a protein containing the identified human-specific pentat would be unique to humans and might have played a role in human evolution.
2. Materials and methods
2.1 The SCS package
Assuming that small changes in amino acids in proteins (or corresponding nucleotide changes in DNA) contribute significantly to phenotypic differences between humans and chimpanzees, the concept of SCS-based methods is to detect small amino acid usage differences between species in an alignment-independent manner. The SCS package is an open web service containing six applications (plus the latest application to analyze idiom networks under development ) for protein analyses (http://scspackage.ads.ie.u-ryukyu.ac.jp/) . These applications run in reference to the database downloaded from the nonredundant (nr) database of the NCBI (National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda), which was downloaded on August 20, 2015. Because T-cell receptors, B-cell receptors, and antibodies are produced by somatic recombination and hypermutation, protein records containing the following keywords in sequence names were excluded: anti, IgG, IgM, IgA, IgD, IgE, BCR, TCR, B-cell receptor, T-cell receptor, Ig, and immunoglobulin. A complete match, including spaces, was required to be excluded. Frequencies and availability scores for all possible
In this study, we focused on 5-aa SCSs (pentats). For multiple species comparison, the availability score difference,
2.2 Bioinformatics web services
After identifying FAM75 using the SCS package, available information on FAM75 and FAM205A was gathered using various web sites. The location FAM75/FAM205A on chromosomes and their single nucleotide polymorphism (SNP) variants were searched using Map Viewer (www.ncbi.nlm.gov/mapview/) in the NCBI server. For various information on human transcripts, we referred to H-InvDB (www.h-invitational.jp/hinv/ahg-db/index_ja.jsp) . This site provides curated information on gene structure, splicing variants, functional RNAs, protein functions, functional domains, intracellular distribution, metabolic pathways, three-dimensional structures, disease relationships, genetic polymorphism (SNPs, indels, microsatellites, and others), gene expression profiles, molecular evolutionary characters, protein–protein interactions, and gene families. Tissue-specific expression profiles were searched using H-ANGEL (http://www.h-invitational.jp/hinv/h-angel/wge_top.cgi?), a database for human gene expression profiles, in the H-InvDB server. Information on alternative splicing variants of the nonhuman primates and mouse was obtained from Map Viewer in NCBI. We referred to the following latest annotations: chimpanzee (Annotation Release 103), western gorilla (Annotation Release 100), Sumatran orangutan (Annotation Release 102), and laboratory mouse (Annotation Release 106). We frequently used protein BLAST in the NCBI server for conventional similarity search (blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins)  and performed multiple sequence alignments when necessary using MEGA7 (www.megasoftware.net/) . In addition, the cDNA sequence of FAM75 was subjected to RegRNA analysis (regrna.mbc.nctu.edu.tw/html/about.html) to identify any possible sequence motifs in FAM75 mRNA .
To further examine SNP variants in human populations, we used dbSNP (www.ncbi.nlm.nih.gov/snp/)  and the 1000 Genome Project by IGSR (The International Genome Sample Resource) (www.internationalgenome.org) . Protein expression was examined using The Human Protein Atlas (www.proteinatlas.org) [62, 63]. This site contains immunohistochemical data for various human tissues. For identification of transmembrane domains in FAM75 and FAM205A, SOSUI (harrier.nagahama-i-bio.ac.jp/sosui/)  and TMHMM (www.cbs.dtu.dk/services/TMHMM/)  were used. For the subcellular distributions of FAM75 and FAM205A, PSORT II Prediction (psort.hgc.jp/form2.html)  was used. Pfam (pfam.xfam.org)  was used for the identification of protein families. Two applications of the SCS Package, “sequence analysis based on availability scores of short constituent amino acid sequences” (scspackage.ads.ie.u-ryukyu.ac.jp/sequence-analysis.php) [32, 33] and “extraction of idiomatic connections between triplets in proteins” (scspackage.ads.ie.u-ryukyu.ac.jp/extraction-of-idiomatic-connections.php) , were used to identify possible functional sites in FAM75. EMBOSS Pepwindow (emboss.sourceforge.net/index.html)  was used for the Kyte-Doolittle hydropathy plot. These web sites were accessed mainly in 2017 and 2018 and were reconfirmed in 2019.
2.3 Human cDNA samples for tissue expression profiling
For the cDNA template, we purchased human MTC (multiple tissue cDNA) panels I and II (Takara Bio, Kusatsu, Shiga, Japan). The panels contain first-strand cDNA from polyA+ RNA and are free from genomic DNA. The amounts of cDNA are approximately 1.0 ng/μL and are normalized to four housekeeping genes, phospholipase A2, G3PDH (glyceraldehyde-3-phosphate dehydrogenase), β-actin, and α-tubulin, which makes it possible to compare expression levels among different tissues. Panels I and II together contain cDNA samples from the following 16 human tissues: heart, brain, placenta, lung, liver, skeletal muscle, kidney, pancreas, spleen, thymus, prostate, testis, ovary, small intestine without mucosal lining, colon with mucosal lining, and peripheral blood leukocyte. Each tissue sample was pooled from 1 to 550 Caucasians, and the testis sample was pooled from 45 Caucasians aged 14–64, according to the manufacturer’s specifications.
2.4 PCR primers
Based on the cDNA sequence of FAM75, we designed two sets of PCR primers for nested PCR using Primer-BLAST (www.ncbi.nlm.nih.gov/tools/primer-blast/). The first set was to amplify both FAM75 and FAM205A from the consensus region, and the second set was to amplify FAM205A from the region that is present only in FAM205A. For the first set, the first-round forward primer was 5’-TTACCAGGTACTGTCACTGAACAC-3′, and its paired reverse primer was 5’-TTCTGAAGCTAGACTCTGTAAGGC-3′. This first round of PCR was expected to amplify 1387 bp. The second-round (nested) forward primer was 5’-AGTTGTACAGACGTTGCAAAAGAG-3′, and its paired reverse primer was 5’-TTTCTGAAGCTAGACTCTGTAAGGC-3′. This second round (nested) PCR was expected to amplify 1097 bp.
For the second set, the first-round forward primer was 5’-ATATCCCTTATACATCTATGGCTCCATCTTC-3′, and its paired reverse primer was 5’-TTTTATTTCTGAAGCTAGACTCTGTAAGGC-3′. This round of PCR was expected to amplify 3608 bp. The second-round (nested) forward primer was 5’-GTATGTCTTTAGATCAGAGTCTGGAGTTTC-3′, and its paired reverse primer was 5’-TTTATTTCTGAAGCTAGACTCTGTAAGGCTG-3′. This round of PCR was expected to amplify 3206 bp.
2.5 PCR conditions
We used an Astec PC320 thermal cycler (Fukuoka, Japan) and Tks Gflex DNA polymerase (Takara Bio) for PCR. According to the manufacturer’s specifications, this DNA polymerase has high fidelity; the error rate was reported to be 0.0131%.
The original cDNA sample from the human MTC panels (Takara Bio) was diluted 10 times to make PCR template samples. The following solutions were mixed to start PCR: Gflex PCR buffer 12.5 μL, deionized water 8.5 μL, DNA polymerase 0.5 μL, forward primer 0.5 μL, reverse primer 0.5 μL, and cDNA template 2.5 μL in a total amount of 25.0 μL. The nested PCR was performed using 2.5 μL reaction solution from the first-round PCR. In both the first and second (nested) rounds, a negative control was performed using deionized water without template cDNA.
The first PCR cycles were performed as follows: an initial denaturing step at 94°C (5 min); 10 cycles of 98°C (30 s), 60°C (30 s; −0.5°C/cycle), and 68°C (1 min); 30 cycles of 98°C (30 s), 55°C (30 s), and 68°C (1 min); and a last extension step at 68°C (30 s). The second (nested) PCR cycles were the same as the first PCR cycles except the duration of the initial denaturing step at 94°C (1 min). In both the first and second PCRs, the first 10 cycles were subjected to stepwise temperature reduction (i.e., touch-down PCR); the first cycle was 60.0°C, the second cycle was 59.5°C, and the third cycle was 59.0°C, and so on.
Positive controls were performed with G3PDH primers that were supplied in the Human MTC Panels (Takara Bio) from the manufacturer. The PCR product was expected to be 938 bp. The primer sequences were as follows: 5’-TGAAGGTCGGAGTCAACGGATTTGGT-3′ for the forward primer and 5’-CATGTGGGCCATGAGGTCCACCAC-3′ for the paired reverse primer. PCR cycles were as follows: an initial denaturing step at 95°C (1 min); 38 cycles of 95°C (30 s) and 68°C (3 min); and a final extension step at 68°C (3 min).
PCR products (1.0 μL) were subjected to 0.8% agarose gel electrophoresis in TAE buffer and stained with ethidium bromide for visualization. The PCR products were run with λHindIII DNA size marker (New England Biolabs, Ipswich, MA, USA).
3.1 Identifying candidate human-specific pentats
Availability scores (
3.2 Human proteins containing WRWSH
Human proteins containing WRWSH were identified using the “search for amino acid sequences of species” program, one of the SCS Package programs. Among all 148 hits, 16 hits were related to “mucin-19-like isoform,” 55 hits to “glycine-rich cell wall structural protein,” 28 hits to “RNA-binding protein,” 48 hits to “uncharacterized transmembrane protein,” and 1 hit to “unnamed protein product.” Unfortunately, these sequences except the last one, “unnamed protein product,” were all “predicted” informatically as parts of “
3.3 FAM75 and its related FAM205A
According to the NCBI record, FAM75 is a protein containing 1014 aa, and its cDNA coding sequence was 3274 bp (Accession No. AK125949.1). It is important to stress that FAM75 has been identified as cDNA from NEDO human cDNA sequencing project (www.nite.go.jp/en/nbrc/genome/project/annotation/cdna.html), and thus this protein is not likely an error product from genome sequencing. A protein BLAST search using FAM75 as a query identified the record “protein FAM205A [
3.4 Gene structures: alternative splicing and polymorphism
A UniGene search revealed that the FAM7/FAM205A gene was located at 9p13.3 on chromosome 9 in the human genome . As expected, their exon-intron structures were different (Figure 1). FAM75 had a single exon, whereas FAM205A had four exons. The exon of FAM75 had high homology with the fourth exon of FAM205A. The 5’-UTR of FAM75 also corresponded to the fourth exon of FAM205A. Clearly, these two RNA transcripts and their proteins are products of alternative splicing from the same genomic locus.
H-InvDB revealed two additional splicing variants (HIT000496944 and HIT000496575) from the same locus at 9p13.3 (Figure 1). The record HIT000496944 in the NCBI database was “
Because not all RNA transcripts are translated into proteins, we used a RegRNA search of UTRs (untranslated regions) to examine the integrity of the FAM75 mRNA. The RegRNA search revealed that the 5’-UTR of FAM75 had an internal ribosome entry site (IRES) [72, 73, 74] among other motifs, suggesting that the FAM75 mRNA is likely translated into proteins.
3.5 WRWSH and WRWSR in human populations
The G/A difference in FAM75/FAM205A in genomic DNA (corresponding to the H/R difference in WRWSH or WRWSR) was confirmed to be a SNP in humans, according to dbSNP. We found that this SNP was widespread in the human genome, and the G/A ratio was dependent on regional populations, as revealed by the 1000 Genomes Project (Figure 2). Among human populations, African populations had a high G frequency (i.e., WRWSR); the three highest G-frequency populations were Gambian in Western Division (96.16%); Yoruba in Ibadan, Nigeria (96.30%); and Mende in Sierra Leone (93.53%). In contrast, Asian and European populations had relatively high A frequency (i.e., WRWSH); the three highest A-frequency populations were Chinese Dai in Xishuangbanna (70.43%); Kinh in Ho Chi Minh City, Vietnam (67.17%); and Han Chinese South, China (61.43%).
3.6 Homologous proteins and alternative splicing products in other animals
Here, we searched for homologous proteins for FAM75 and FAM205A in other animals. Among the nine mammals used for the initial identification for WRWSH, homologous proteins for FAM205A were identified by BLAST search (Table 2); all proteins were FAM205A homologs, and all proteins in primates (chimpanzee, gorilla, and orangutan) contained WRWSR (corresponding to FAM205A) but not WRWSH (corresponding to FAM75), suggesting that WRWSH in FAM75 may be unique in humans among primates. In other nonprimate animals that were examined here, this pentat sequence was either not conserved at all or nonexistent.
|Name in FASTA file||ID||Pentat|
|Gorilla||protein FAM205A like||XP_004048025.1||WRWS|
|Orangutan||protein FAM205A isoform||XP_009242592.1||WRWS|
|Mouse||predicted gene 12,429 isoform||XP_011248363.1|
|Rat||protein FAM205-A isoform||XP_008774156.1|
|Opossum||protein FAM205-A like isoform||XP_007498908.1|
To further examine whether splicing variants exist in other great apes, we checked the genome loci and transcript data using Map Viewer. In chimpanzees (Figure 3) and gorillas (not shown), there were no alternative splicing transcripts from this locus. In orangutans (not shown), there were three isoforms, the X1, X2, and X3 transcripts, from this locus. However, these transcripts were very similar to one another, and they were all considered FAM205A homologs containing WRWSR. We also examined the genome of the mouse as a representative nonprimate mammal (not shown). There were three transcript variants: “predicted gene 12429 isoform X1, X2” and “predicted gene 12429.” They all contained SLQAQ instead of WRWSH in these proteins, and their splicing patterns were different from those of FAM75. We confirmed that human splicing patterns (Figure 4) were different from those of these mammals. Therefore, we conclude that the FAM75 transcript was found only in humans.
3.7 Testis-specific expression of FAM75 and FAM205A
To examine its existence and expression in our laboratory, we performed RT-PCR (reverse transcription polymerase chain reaction) using two sets of PCR primers using 16 different human-tissue cDNA pools as templates. The first set of primers was designed to amplify both FAM75 and FAM205A (Figure 5A), and the second set was designed to amplify FAM205A only (Figure 5B). Due to their overlapping nature, exclusive amplification of FAM75 was not possible. In both primer sets, testis-specific expression was observed. A positive control using a primer set for G3PDH showed amplification from all tissues (Figure 5C), and a negative control (without cDNA template but with experimental primer sets) did not show any amplification.
Our results were consistent with the H-ANGEL expression database; in this database, FAM75 and FAM205A were not differentiated, but the database indicated that the expression was testis-specific (Figure 6). The NCBI database also indicated the testis-specific expression of FAM205A (not shown). The expression pattern of FAM205A was also found in the Human Protein Atlas, in which FAM205A was expressed in testis and in no other tissues examined at the mRNA level (not shown), confirming our PCR-based data. According to the Human Protein Atlas, cells in the seminiferous ducts (sperm and immature sperm cells) of the testis were clearly detected, but Leydig cells were not stained immunohistochemically (Figure 7). As mentioned in the Human Protein Atlas, staining was clearly detected in acrosomes in spermatids (Figure 7). Considering that the antibody used in the Human Protein Atlas could not differentiate FAM205A and FAM75 (because a recombinant C-terminal 104 aa fragment that is almost identical in both FAM205A and FAM75 was used as an antigen), both proteins were likely stained in the tissue sections.
3.8 Structural and functional predictions
We performed several sequence analyses to characterize the sequences of FAM75 (Figure 8). When FAM75 and FAM205A were subjected to SOSUI, the former was predicted as a soluble protein, but the latter was predicted as a membrane protein with a single transmembrane helix. TMHMM also showed essentially the same results. Indeed, the Human Protein Atlas considered FAM205A to be both a membrane protein and a cytoplasmic protein based on immunohistochemical results, suggesting that FAM75 and FAM205A may be detected in the cytoplasm and in membranes, respectively, as predicted by SOSUI and TMHMM. In contrast, both were predicted to be “nuclear” using PSORT II Prediction.
To search for possible functional sites, different amino acids between FAM75 and FAM205A (Table 1), conserved amino acids among FAM205A and similar sequences (top 100 BLAST data) based on multiple alignment, Pfam domain data, a hydropathy plot, an availability plot, and an idiom plot were aligned together (Figure 8). Conserved amino acids were located mostly in the N-terminal side, in which the “FAM75 domain” identified by Pfam was also located. WRWSH was located at the center of the hydrophilic region in the C-terminal side and corresponded to a high availability region and a high idiom region, although their significance was not clear at this point.
In this paper, we identified a WRWSH-containing protein, FAM75, as a candidate human-specific protein. We assumed that pentats with high availability scores in humans and no occurrence (
The present study showed that the SCS-based approach is a relevant addition to a list of practical sequence comparison methods. As with other methods, the SCS-based method is influenced by SNPs, accuracy, and the amount of information in databases. For example, the human genome has numerous SNP variations, and there is much less genomic information for other primates than for humans.
FAM75 and FAM205A appear to be alternative splicing products from the same genomic locus in humans (Figure 1). The relationship of the evolutionary invention of FAM75 as an alternative splicing product with that of a SNP variant for WRWSH is unclear. We cannot exclude the possibility that this may be a simple coincidence, but this coincidence is in accordance with our starting hypothesis for this study: proteins containing a human-specific pentat may indeed be human-specific as proteins. We confirmed the expression of FAM205A and/or FAM75 at the mRNA level in human tissues (Figure 5). At the protein level, the FAM205A protein (and probably also the FAM75 protein) was shown to be located in cells in seminiferous ducts and in acrosomes in spermatids in the testis (Figure 7). Interestingly, FAM205A was also detected in the human sperm nucleus in a proteomic study . Although it is difficult to distinguish FAM75 and FAM205A at the mRNA and protein levels, it is demonstrated that the FAM75/FAM205A gene is not a pseudogene, and protein products are actively produced in testis. The discovery of the IRES element in FAM75 mRNA also supports the idea that FAM75 mRNA is actively translated into proteins. On the other hand, we found two additional alternative splicing products in H-InvDB (Figure 1). These additional mRNAs were not examined in this paper, because of insufficient information. However, their status is of interest if they really exist; they may have similar but slightly different functions from FAM205A and FAM75.
Mechanistically, alternative splicing may be a relatively easy way to create a new protein sequence. It may be considered not only a “regulatory change” (according to the regulatory hypothesis, because the evolutionary invention of a new alternative splicing product conserves the original protein-coding DNA sequence and gene function and thus is more conservative with respect to species evolution) but also a “sequence change” (according to the constituent hypothesis, because the protein sequence is changed). These two modes are likely intermingled in this case. To extrapolate this argument, transcriptome studies of alternative splicing or RNA processing may be fruitful to identify human-specific genes. The present discovery of the IRES element in the FAM75 mRNA may be surprising because IRES elements are mostly viral, and cellular elements are relatively rare [72, 73, 74]. A search for IRES elements in the genome may also be fruitful.
The evolution of WRWSH and FAM75 in relation to human speciation is an important but uncertain aspect to be discussed. There are two kinds of “human-specific” proteins. First, a group of proteins may have been involved in the early step of speciation of
What is the function of FAM75 in human testes? According to the results of immunohistochemistry (the Human Protein Atlas), SOSUI, and TMHMM, we speculate that FAM75 appears to function differently from FAM205A in different cellular sites. Because FAM75 is likely located in acrosomes (Figure 7), this protein may be involved in the process of fertilization. A possibility is that FAM75 confers human specificity to prevent cross-species fertilization with ancestral species. The FAM75/FAM205A genomic locus in humans has an additional two alternative splicing products, which were not pursued in the present study, and orangutans and mice appeared to have three transcripts from the same locus. It is tempting to speculate that this locus partly contributes to speciation in primates and other mammals by restricting cross-species fertilization in ancestral species.
Molecularly, the main function of FAM75 may be located in the “FAM75 domain” located at the N-terminal side of the molecule (Figure 8), but because WRWSH is located in a hydrophilic region at the C-terminal side of the molecule, this hydrophilic region may function in human specificity. Indeed, the conserved regions are mostly located at the N-terminal side, probably for the general function of FAM75. The hydrophilic region also coincides with high availability and idiom-cluster regions.
Testis is known to be the tissue of the fastest evolution among other tissues based on gene expression comparisons in mammals, including the great apes [13, 14, 16, 17, 18, 79]. This flexibility may reflect diverse species-specific sexual behaviors. Mating is nonselective and frequent in chimpanzees, and only the highest-ranked male can mate in gorillas . These behaviors have been thought to be related to testis-size differences; the chimpanzee has relatively large testes, and the gorilla has small ones . Human testis size lies between these extremes, which may be related to the molecular evolution of FAM75 to modulate sperm development in testes or to withstand moderate sperm competition.
A recent finding that the gene locus for FAM205A is a susceptible locus for intracerebral hemorrhage (ICH)  is somewhat surprising. Either FAM205A or FAM75 may be expressed in cerebral cells at low levels or in restricted regions of the brain. It is tempting to speculate that a pleiotropic protein for both fertilization and brain development, such as FAM75/FAM205A, might have played a role in human evolution. The fact that the FAM205A/FAM75 gene is located not in a sex chromosome but in chromosome 9, despite its expression in the testis, might further suggest its dual role in sexual and nonsexual aspects of human specificity.
Our SCS-based approach identified FAM75, a WRWSH-containing protein, as a candidate human-specific protein. Its uniqueness in humans may be acquired not only by a point mutation for WRWSH but also by novel alternative splicing. Together with FAM205A, FAM75 is likely expressed in human testis, and its possible expression in acrosomes suggests its potential function in fertilization and thus in human speciation. Its potential pleiotropic function in the brain is very interesting and may also be investigated in the future.
We thank Miki Kawauchi, Motosuke Tsutsumi, Hideka Konno, and other members of the BCPH Unit of Molecular Physiology for technical assistance and discussions. This work was supported by the Sekisui Chemical Grant Program for Research to JMO. This work was also supported by basic funds to JMO and MN from the University of the Ryukyus.
Conflict of interest
Authors declare no competing interests.